Just a Theory

By David E. Wheeler

Please Test Pod::Simple 3.29_3

Pod Book

I’ve just pushed Pod-Simple 3.29_v3 to CPAN. Karl Williamson did a lot of hacking on this release, finally adding support for EBCDIC. But as part of that work, and in coordination with Pod::Simple’s original author, Sean Burke, as well as pod-people, we have switched the default encoding from Latin-1 to CP-1252.

On the surface, that might sound like a big change, but in truth, it’s pretty straight-forward. CP-1252 is effectively a superset of Latin-1, repurposing 30 or so unused control characters from Latin-1. Those characters are pretty common on Windows (the home of the CP family of encodings), especially in pastes from Word. It’s nice to be able to pick those up essentially for free.

Still, Karl’s done more than that. He also updated the encoding detection to do a better job at detecting UTF-8. This is the real default. Pod::Simple only falls back on CP1252 if there are no obvious UTF-8 byte sequences in your Pod.

Overall these changes should be a great improvement. Better encoding support is always a good idea. But it is a pretty significant change, including a change to the Pod spec. Hence the test release. Please make sure it works well with your code by installing it today:

cpan D/DW/DWHEELER/Pod-Simple-3.29_3.tar.gz
cpanm DWHEELER/Pod-Simple-3.29_3.tar.gz

Oh, and one last thing: If Pod::Simple fails to properly recognize the encoding in your Pod file, you can always use the =encoding command early in your Pod file to make it explicit:

=encoding CP1254

Build Modern Perl RPMs with rpmcpan

iovation + Perl = Love

We’ve been using the CentOS Perl RPMs at iovation to run all of our Perl applications. This has been somewhat painful, because the version of Perl, 5.10.1, is quite old — it shipped in August 2009. In fact, it consists mostly of bug fixes against Perl 5.10.0, which shipped in December 2007! Many of the modules provided by CentOS core and EPEL are quite old, as well, and we had built up quite the collection of customized module RPMs managed by a massive spaghetti-coded Jenkins job. When we recently ran into a Unicode issue that would best have been addressed by running a more modern Perl — rather than a hinky workaround — I finally sat down and knocked out a way to get a solid set of Modern Perl and related CPAN RPMs.

I gave it the rather boring name rpmcpan, and now you can use it, too. Turns out, DevOps doesn’t myopically insist on using core RPMs in the name of some abstract idea about stability. Rather, we just need a way to easily deploy our stuff as RPMs. If the same applies to your organization, you can get Modern Perl RPMs, too.

Here’s how we do it. We have a new Jenkins job that runs both nightly and whenever the rpmcpan Git repository updates. It uses the MetaCPAN API to build the latest versions of everything we need. Here’s how to get it to build the latest version of Perl, 5.20.1:

./bin/rpmcpan --version 5.20.1

That will get you a nice, modern Perl RPM, named perl520, completely encapsulated in /usr/local/perl520. Want 5.18 instead: Just change the version:

./bin/rpmcpan --version 5.18.2

That will give you perl518. But that’s not all. You want to build CPAN distributions against that version. Easy. Just edit the dists.json file. Its contents are a JSON object where the keys name CPAN distributions (not modules), and the values are objects that customize our RPMs get built. Most of the time, the objects can be empty:

{
    "Try-Tiny": {}
}

This results in an RPM named perl520-Try-Tiny (or perl518-Try-Tiny, etc.). Sometimes you might need additional information to customize the CPAN spec file generated to build the distribution. For example, since this is Linux, we need to exclude a Win32 dependency in the Encode-Locale distribution:

{
    "Encode-Locale": { "exclude_requires": ["Win32::Console"] }
}

Other distributions might require additional RPMs or environment variables, like DBD-Pg, which requires the PostgreSQL RPMs:

{
    "DBD-Pg": {
        "build_requires": ["postgresql93-devel", "postgresql93"],
        "environment": { "POSTGRES_HOME": "/usr/pgsql-9.3" }
    }
}

See the README for a complete list of customization options. Or just get started with our dists.json file, which so far builds the bare minimum we need for one of our Perl apps. Add new distributions? Send a pull request! We’ll be doing so as we integrate more of our Perl apps with a Modern Perl and leave the sad RPM past behind.

More about…

Sqitch on FLOSS Weekly

Yours truly was feature in this week’s episode of FLOSS Weekly, talking about Sqitch. I feel pretty good about this interview, despite continually banging on my legs, the table, and the mic. It’s interesting to try to communicate what Sqitch is about purely by talking.

If it’s enough to get you interested in giving a try, try installing it and using working through one of the tutorials:

More about…

Localize Your Perl Apps with this One Weird Trick

Nota Bene: This is a republication of a [post that originally appeared in the 2013 Perl Advent Calendar.


These days, gettext is far and away the most widely-used localization (l10n) and internationalization (i18n) library for open-source software. So far, it has not been widely used in the Perl community, even though it’s the most flexible, capable, and easy-to use solution, thanks to Locale::TextDomain.1 How easy? Let’s get started!

Module Internationale

First, just use Locale::TextDomain. Say you’re creating an awesome new module, Awesome::Module. These CPAN distribution will be named Awesome-Module, so that’s the “domain” to use for its localizations. Just let Locale::TextDomain know:

use Locale::TextDomain 'Awesome-Module';

Locale::TextDomain will later use this string to look for the appropriate translation catalogs. But don’t worry about that just yet. Instead, start using it to translate user-visible strings in your code. With the assistance of the Locale::TextDomain’s [comprehensive documentation], you’ll find it second nature to internationalize your modules in no time. For example, simple strings are denoted with __:

say __ 'Greetings puny human!';

If you need to specify variables, use __x:

say __x(
   'Thank you {sir}, may I have another?',
   sir => $username,
);

Need to manage plurals? Use __n:

say __n(
    'I will not buy this record, it is scratched.',
    'I will not buy these records, they are scratched.',
    $num_records
);

If $num_records is 1, the first phrase will be used. Otherwise the second.

Sometimes you gotta do both, mix variables and plurals. __nx has got you covered there:

say __nx(
    'One item has been grokked.',
    '{count} items have been grokked.',
    $num_items,
    count => $num_items,
);

Congratulations! Your module is now internationalized. Wasn’t that easy? Make a habit of using these functions in all the modules in your distribution, always with the Awesome-Module domain, and you’ll be set.

Encode da Code

Locale::TextDomain is great, but it dates from a time when Perl character encoding was, shall we say, sub-optimal. It therefore took it upon itself to try to do the right thing, which is to to detect the locale from the runtime environment and automatically encode as appropriate. Which might work okay if all you ever do is print localized messages — and never anything else.

If, on the other hand, you will be manipulating localized strings in your code, or emitting unlocalized text (such as that provided by the user or read from a database), then it’s probably best to coerce Locale::TextDomain to return Perl strings, rather than encoded bytes. There’s no formal interface for this in Locale::TextDomain, so we have to hack it a bit: set the $OUTPUT_CHARSET environment variable to “UTF-8” and then bind a filter. Don’t know what that means? Me neither. Just put this code somewhere in your distribution where it will always run early, before anything gets localized:

use Locale::Messages qw(bind_textdomain_filter);
use Encode;
BEGIN {
    $ENV{OUTPUT_CHARSET} = 'UTF-8';
    bind_textdomain_filter 'Awesome-Module' => \&Encode::decode_utf8;
}

You only have to do this once per domain. So even if you use Locale::TextDomain with the Awesome-Module domain in a bunch of your modules, the presence of this code in a single early-loading module ensures that strings will always be returned as Perl strings by the localization functions.

Environmental Safety

So what about output? There’s one more bit of boilerplate you’ll need to throw in. Or rather, put this into the main package that uses your modules to begin with, such as the command-line script the user invokes to run an application.

First, on the shebang line, follow Tom Christiansen’s advice and put -CAS in it (or set the $PERL_UNICODE environment variable to AS). Then use the POSIX setlocale function to the appropriate locale for the runtime environment. How? Like this:

#!/usr/bin/perl -CAS

use v5.12;
use warnings;
use utf8;
use POSIX qw(setlocale);
BEGIN {
    if ($^O eq 'MSWin32') {
        require Win32::Locale;
        setlocale POSIX::LC_ALL, Win32::Locale::get_locale();
    } else {
        setlocale POSIX::LC_ALL, '';
    }
}

use Awesome::Module;

Locale::TextDomain will notice the locale and select the appropriate translation catalog at runtime.

Is that All There Is?

Now what? Well, you could do nothing. Ship your code and those internationalized phrases will be handled just like any other string in your code.

But what’s the point of that? The real goal is to get these things translated. There are two parts to that process:

  1. Parsing the internationalized strings from your modules and creating language-specific translation catalogs, or “PO files”, for translators to edit. These catalogs should be maintained in your source code repository.

  2. Compiling the PO files into binary files, or “MO files”, and distributing them with your modules. These files should not be maintained in your source code repository.

Until a year ago, there was no Perl-native way to manage these processes. Locale::TextDomain ships with a sample Makefile demonstrating the appropriate use of the GNU gettext command-line tools, but that seemed a steep price for a Perl hacker to pay.

A better fit for the Perl hacker’s brain, I thought, is Dist::Zilla. So I wrote Dist::Zilla::LocaleTextDomain to encapsulate the use of the gettext utiltiies. Here’s how it works.

First, configuring Dist::Zilla to compile localization catalogs for distribution: add these lines to your dist.ini file:

[ShareDir]
[LocaleTextDomain]

There are configuration attributes for the LocaleTextDomain plugin, such as where to find the PO files and where to put the compiled MO files. In case you didn’t use your distribution name as your localization domain in your modules, for example:

use Locale::TextDomain 'com.example.perl-libawesome';

Then you’d set the textdomain attribute so that the LocaleTextDomain plugin can find the translation catalogs:

[LocaleTextDomain]
textdomain = com.example.perl-libawesome

Check out the configuration docs for details on all available attributes.

At this point, the plugin doesn’t do much, because there are no translation catalogs yet. You might see this line from dzil build, though:

[LocaleTextDomain] Skipping language compilation: directory po does not exist

Let’s give it something to do!

Locale Motion

To add a French translation file, use the msg-init command2:

% dzil msg-init fr
Created po/fr.po.

The msg-init command uses the GNU gettext utilities to scan your Perl source code and initialize the French catalog, po/fr.po. This file is now ready translation! Commit it into your source code repository so your agile-minded French-speaking friends can find it. Use msg-init to create as many language files as you like:

% dzil msg-init de ja.JIS en_US.UTF-8 en_UK.UTF-8
Created po/de.po.
Created po/ja.po.
Created po/en_US.po.
Created po/en_UK.po.

Each language has its on PO file. You can even have region-specific catalogs, such as the en_US and en_UK variants here. Each time a catalog is updated, the changes should be committed to the repository, like code. This allows the latest translations to always be available for compilation and distribution. The output from dzil build now looks something like:

po/fr.po: 10 translated messages, 1 fuzzy translation, 0 untranslated messages.
po/ja.po: 10 translated messages, 1 fuzzy translation, 0 untranslated messages.
po/en_US.po: 10 translated messages, 1 fuzzy translation, 0 untranslated messages.
po/en_UK.po: 10 translated messages, 1 fuzzy translation, 0 untranslated messages.

The resulting MO files will be in the shared directory of your distribution:

% find Awesome-Module-0.01/share -type f
Awesome-Module-0.01/share/LocaleData/de/LC_MESSAGES/Awesome-Module.mo
Awesome-Module-0.01/share/LocaleData/en_UK/LC_MESSAGES/Awesome-Module.mo
Awesome-Module-0.01/share/LocaleData/en_US/LC_MESSAGES/Awesome-Module.mo
Awesome-Module-0.01/share/LocaleData/ja/LC_MESSAGES/Awesome-Module.mo

From here Module::Build or ExtUtils::MakeMaker will install these MO files with the rest of your distribution, right where Locale::TextDomain can find them at runtime. The PO files, on the other hand, won’t be used at all, so you might as well exclude them from the distribution. Add this line to your MANIFEST.SKIP to prevent the po directory and its contents from being included in the distribution:

^po/

Mergers and Acquisitions

Of course no code base is static. In all likelihood, you’ll change your code — and end up adding, editing, and removing localizable strings as a result. You’ll need to periodically merge these changes into all of your translation catalogs so that your translators can make the corresponding updates. That’s what the the msg-merge command is for:

% dzil msg-merge
extracting gettext strings
Merging gettext strings into po/de.po
Merging gettext strings into po/en_UK.po
Merging gettext strings into po/en_US.po
Merging gettext strings into po/ja.po

This command re-scans your Perl code and updates all of the language files. Old messages will be commented-out and new ones added. Commit the changes and give your translators a holler so they can keep the awesome going.

Template Scan

The msg-init and msg-merge commands don’t actually scan your source code. Sort of lied about that. Sorry. What they actually do is merge a template file into the appropriate catalog files. If this template file does not already exist, a temporary one will be created and discarded when the initialization or merging is done.

But projects commonly maintain a permanent template file, stored in the source code repository along with the translation catalogs. For this purpose, we have the msg-scan command. Use it to create or update the template, or POT file:

% dzil msg-scan
extracting gettext strings into po/Awesome-Module.pot

From here on in, the resulting .pot file will be used by msg-init and msg-merge instead of scanning your code all over again. But keep in mind that, if you do maintain a POT file, future merges will be a two-step process: First run msg-scan to update the POT file, then msg-merge to merge its changes into the PO files:

% dzil msg-scan
extracting gettext strings into po/Awesome-Module.pot
% dzil msg-merge
Merging gettext strings into po/de.po
Merging gettext strings into po/en_UK.po
Merging gettext strings into po/en_US.po
Merging gettext strings into po/ja.po

Lost in Translation

One more thing, a note for translators. They can, of course, also use msg-scan and msg-merge to update the catalogs they’re working on. But how do they test their translations? Easy: use the msg-compile command to compile a single catalog:

% dzil msg-compile po/fr.po
[LocaleTextDomain] po/fr.po: 195 translated messages.

The resulting compiled catalog will be saved to the LocaleData subdirectory of the current directory, so it’s easily available to your app for testing. Just be sure to tell Perl to include the current directory in the search path, and set the $LANGUAGE environment variable for your language. For example, here’s how I test the [Sqitch] French catalog:

% dzil msg-compile po/fr.po              
[LocaleTextDomain] po/fr.po: 148 translated messages, 36 fuzzy translations, 27 untranslated messages.
% LANGUAGE=fr perl -Ilib -CAS -I. bin/sqitch foo
"foo" n'est pas une commande valide

Just be sure to delete the LocaleData directory when you’re done — or at least don’t commit it to the repository.

TL;DR

This may seem like a lot of steps, and it is. But once you have the basics in place — Configuring the Dist::Zilla::LocaleTextDomain plugin, setting up the “textdomain filter”, setting and the locale in the application — there are just a few habits to get into:

  • Use the functions __, __x, __n, and __nx to internationalize user-visible strings
  • Run msg-scan and msg-merge to keep the catalogs up-to-date
  • Keep your translators in the loop.

The Dist::Zilla::LocaleTextDomain plugin will do the rest.


  1. What about Locale::Maketext, you ask? It has not, alas, withsthood the test of time. For details, see Nikolai Prokoschenko’s epic 2009 polemic, “On the state of i18n in Perl.” See also Steffen Winkler’s presentation, Internationalisierungs-Framework auswählen (and the English translation by Aristotle Pagaltzis), from German Perl Workshop 2010.

  2. The msg-init function — like all of the dzil msg-* commands — uses the GNU gettext utilities under the hood. You’ll need a reasonably modern version in your path, or else it won’t work.

Sqitch Goes Vertical

I released Sqitch v0.996 today. Despite the minor version increase, this is a pretty big release. I’m busy knocking out all the stuff I want to get done for 1.0, but the version space is running out, so just a minor version jump from v0.995 to v0.996. But a lot changed. A couple the biggies:

Goodbye Mouse and Moose, Hello Moo

If you’re not a Perl programmer, you probably aren’t familiar with Moose or its derivatives Mouse and Moo. Briefly, it’s an object system. Great interface and features, but freaking *huge*—and slow. Mouse is a lighter version, and when we (mostly) switched to it last year, it yielded a 20-30% speed improvement.

Still wasn’t great, though. So on a day off recently, I switched to Moo, which implements most of Moose but without a lot of the baggage. At first, there wasn’t much difference in performance, but as I profiled it (Devel::NYTProf is indispensable for profiling Perl apps, BTW), I was able to root out all trace of Moose or Mouse, including in CPAN modules Sqitch depends on. The result is around a 40% speedup over what we had before. Honestly, it feels like a new app, it’s so fast. I’m really happy with how it turned out, and to have shed some of the baggage from the code base.

The downside is that package maintainers will need to do some work to get the new dependencies built. Have a look at the RPM spec changes I made to get our internal Sqitch RPMs to build v0.996.

MySQL Password Handling

The handling of MySQL passwords has also been improved. Sqitch now uses the $MYSQL_PWD environment variable if a password is provided in a target. This should simplify authentication when running MySQL change scripts through the mysql client client.

Furthermore, if MySQL::Config is installed, Sqitch will look for passwords in the client and mysql sections of your MySQL configuration files (~/.my.cnf, /etc/my.cnf). This should already happen automatically when executing scripts, but Sqitch now tries to replicate that behavior when connecting to the database via DBI.

Spotting the $MYSQL_PWD commit, Ștefan Suciu updated the Firebird engine to use the $ISC_PASSWORD when running scripts. Awesome.

Vertically Integrated

And finally, another big change: I added support for Vertica, a very nice commercial column-store database that features partitioning and sharding, among other OLAP-style functionality. It was originally forked from PostgreSQL, so it was fairly straight-forward to port, though I did have to borrow a bit from the Oracle and SQLite engines, too. This port was essential for work, as we’re starting to use Vertical more and more, and need ways to manage changes.

If you’re using Vertica, peruse the tutorial to get a feel for what it’s all about. If you want to install it, you can get it from CPAN:

cpan install App::Sqitch BDD::ODBC

Or, if you’re on Homebrew:

brew tap theory/sqitch
brew install sqitch_vertica

Be warned that there’s a minor bug in v0.996, though. Apply this diff to fix it:

 @@ -16,7 +16,7 @@ our $VERSION = '0.996';
 
 sub key    { 'vertica' }
 sub name   { 'Vertica' }
-sub driver { 'DBD::Pg 2.0' }
+sub driver { 'DBD::ODBC 1.43' }
 sub default_client { 'vsql' }
 
 has '+destination' => (

That fix will be in the next release, of course, as will support for Vertica 6.

What Next?

I need to focus on some other work stuff for a few weeks, but then I expect to come back to Sqitch again. I’d like to get 1.0 shipped before the end of the year. To that end, next up I will be rationalizing configuration hierarchies to make engine selection and deploy-time configuration more sensible. I hope to get that done by early October.

More about…

Managing Sqitch with Make

chromatic:

This saves me a few dozen keystrokes and a few seconds every time I make a database change. If that sounds trivial to you, good. A few keystrokes and a few seconds are trivial. My brainpower isn’t trivial. Those keystrokes and seconds mean the difference between staying in the zone and fumbling around trying to remember commands I don’t use all day every day. They save me minutes every time I use them, if you count the friction of switching between “How do I do this in Sqitch again? What’s the directory layout here?” and “What was I really working on?”

Nice application of a Makefile to eliminate boilerplate. A couple of notes, though:

Nice post. A couple comments and questions:

  • As of Sqitch v0.990, you can pass the --open-editor option to the add command to have the new files opened in your editor.

  • If you want to add a pgTAP test with a new change, see this post.

  • What is the call to sqitch status for? Since its output just goes to /dev/null, I don’t understand the point.

  • Also as of v0.990, you can specify Sqitch targets. The -d, -u, and other options then override values in the target URI.

  • I really want to get Sqitch to better understand and work with VCSs. An example would be to have it automatically git add files created by sqitch add. Another might be a Git config setting pointing to the Sqitch config file. Alas, I don’t know when I will have the tuits to work on that.

Lots of room for growth and improvement in Sqitch going forward. You post provides more food for thought.

More about…

Templating Tests with Sqitch

Back in September, I described how to create custom deploy, revert, and verify scripts for various types of Sqitch changes, such as adding a new table. Which is cool and all, but what I’ve found while developing databases at work is that I nearly always want to create a test script with the same name as a newly-added change.

So for the recent v0.990 release, the add command gained the ability to generate arbitrary script files from templates. To get it to work, we just have to create template files. Templates can go into ~/.sqitch/templates (for personal use) or in $(sqitch --etc-path)/templates (for use by everyone on a system). The latter is where templates are installed by default. Here’s what it looks like:

> ls $(sqitch --etc-path)/templates
deploy  revert  verify
> ls $(sqitch --etc-path)/templates/deploy
firebird.tmpl  mysql.tmpl  oracle.tmpl  pg.tmpl  sqlite.tmpl

Each directory defines the type of script and the name of the directory in which it will be created. The contents are default templates, one for each engine.

To create a default test template, all we have to do is create a template for our preferred engine in a directory named test. So I created ~/.sqitch/templates/test/pg.tmpl. Here it is:

SET client_min_messages TO warning;
CREATE EXTENSION IF NOT EXISTS pgtap;
RESET client_min_messages;

BEGIN;
SELECT no_plan();
-- SELECT plan(1);

SELECT pass('Test [% change %]!');

SELECT finish();
ROLLBACK;

This is my standard boilerplate for tests, more or less. It just loads pgTAP, sets the plan, runs the tests, finishes and rolls back. See this template in action:

> sqitch add whatever -n 'Adds whatever.'
Created deploy/whatever.sql
Created revert/whatever.sql
Created test/whatever.sql
Created verify/whatever.sql
Added "whatever" to sqitch.plan

Cool, it added the test script. Here’s what it looks like:

SET client_min_messages TO warning;
CREATE EXTENSION IF NOT EXISTS pgtap;
RESET client_min_messages;

BEGIN;
SELECT no_plan();
-- SELECT plan(1);

SELECT pass('Test whatever!');

SELECT finish();
ROLLBACK;

Note that it replaced the change variable in the call to pass(). All ready to start writing tests! Nice, right? If we don’t want the test script created – for example when adding a column to a table for which a test already exists – we use the --without option to omit it:

> sqitch add add_timestamp_column --without test -n 'Adds whatever.'
Created deploy/add_timestamp_column.sql
Created revert/add_timestamp_column.sql
Created verify/add_timestamp_column.sql
Added "add_timestamp_column" to sqitch.plan

Naturally you’ll want to update the existing test to validate the new column.

In the previous templating post, we added custom scripts as for CREATE TABLE changes; now we can add a test template, too. This one takes advantage of the advanced features of Template Toolkit. We name it ~/.sqitch/templates/test/createtable.tmpl to complement the deploy, revert, and verify scripts created previously:

-- Test [% change %]
SET client_min_messages TO warning;
CREATE EXTENSION IF NOT EXISTS pgtap;
RESET client_min_messages;

BEGIN;
SELECT no_plan();
-- SELECT plan(1);

SET search_path TO [% IF schema %][% schema %],[% END %]public;

SELECT has_table('[% table or change %]');
SELECT has_pk( '[% table or change %]' );

[% FOREACH col IN column -%]
SELECT has_column(        '[% table or change %]', '[% col %]' );
SELECT col_type_is(       '[% table or change %]', '[% col %]', '[% type.item( loop.index ) or 'text' %]' );
SELECT col_not_null(      '[% table or change %]', '[% col %]' );
SELECT col_hasnt_default( '[% table or change %]', '[% col %]' );

[% END %]
SELECT finish();
ROLLBACK;

As before, we tell the add command to use the createtable templates:

> sqitch add corp_widgets --template createtable \
  -s schema=corp -s table=widgets \
  -s column=id -s type=SERIAL \
  -s column=name -s type=TEXT \
  -s column=quantity -s type=INTEGER \
  -n 'Add corp.widgets table.'

This yields a very nice test script to get you going:

-- Test corp_widgets
SET client_min_messages TO warning;
CREATE EXTENSION IF NOT EXISTS pgtap;
RESET client_min_messages;

BEGIN;
SELECT no_plan();
-- SELECT plan(1);

SET search_path TO corp,public;

SELECT has_table('widgets');
SELECT has_pk( 'widgets' );

SELECT has_column(        'widgets', 'id' );
SELECT col_type_is(       'widgets', 'id', 'SERIAL' );
SELECT col_not_null(      'widgets', 'id' );
SELECT col_hasnt_default( 'widgets', 'id' );

SELECT has_column(        'widgets', 'name' );
SELECT col_type_is(       'widgets', 'name', 'TEXT' );
SELECT col_not_null(      'widgets', 'name' );
SELECT col_hasnt_default( 'widgets', 'name' );

SELECT has_column(        'widgets', 'quantity' );
SELECT col_type_is(       'widgets', 'quantity', 'INTEGER' );
SELECT col_not_null(      'widgets', 'quantity' );
SELECT col_hasnt_default( 'widgets', 'quantity' );


SELECT finish();
ROLLBACK;

I don’t know about you, but I’ll be using this functionality a lot.

More about…

Brent Simmons is Not Wrong

Brent Simmons:

Database people are already gasping for air, because they know what’s coming. Instead of creating a separate table for attachment metadata, I created an attachments column in the notes table and just encoded the attachment metadata there.

On iOS it uses Core Data’s built-in object archiving feature. On the server it’s stored as JSON.

This is wrong, surely; it’s not how to do this. Except, in this case, it is. Incomplete object graphs are wrong; inefficient and slower syncing with more complex server-side code is also wrong.

This is less wrong than the alternatives.

Some database folks might be gasping for air, but not those of us steeped in relational theory. In Database in Depth, relational theorist C.J. Date poses a question:

“Database in Depth,” by C.J. Date

In Chapter 1, I said that 1NF meant that every tuple in every relation contains just a single value (of the appropriate type, of course) in every attribute position—and it’s usual to add that those “single values” are supposed to be atomic. But this latter requirement raises the obvious question: what does it mean for data to be atomic?

Well, on page 6 of the book mentioned earlier, Codd defines atomic data as data that “cannot be decomposed into smaller pieces by the DBMS (excluding certain special functions).” But even if we ignore that parenthetical exclusion, this definition is a trifle puzzling, and not very precise. For example, what about character strings? Are character strings atomic? Every product I know provides several operators on such strings—LIKE, SUBSTR (substring), “||” (concatenate), and so on—that clearly rely on the fact that character strings in general can be decomposed by the DBMS. So are those strings atomic? What do you think?

The whole book is worth a read, especially the first four chapters, as it does an excellent job of dispelling the myth that complex data types are verboten in a properly normalized relational model. Another gem:

But I could have used any number of different examples to make my point: I could have shown attributes (and therefore domains) that contained arrays; or bags; or lists; or photographs; or audio or video recordings; or X rays; or fingerprints; or XML documents; or any other kind of value, “atomic” or “nonatomic,” that you might care to think of. Attributes, and therefore domains, can contain anything (any values, that is). All of which goes a long way, incidentally, toward explaining why a true “object/relational” system would be nothing more nor less than a true relational system—which is to say, a system that supports the relational model, with all that such support entails.

This is exactly why PostgreSQL offers array, XML, and JSON data types: because sometimes, they are exactly what you need to properly model your system.

So you go, Brent, you’re doing it exactly right.

Sqitch on Target

At the end of the day last week, I released Sqitch v0.990. This was a pretty big release, with lots of changes. The most awesome addition, in my opinion, is Named Deployment targets.

In previous versions of Sqitch, one could set default values for the database to deploy to, but needed to use the --db-* options to deploy to another database. This was fine for development: just set the default on localhost and go. But when it came time to deploy to other servers for testing, QA, or production, it was a bit of a PITA. At work, I ended up writing deployment docs that defined a slew of environment variables, and our operations team needed to adjust those variables to deploy to various servers. It was ugly, and frankly a bit of a pain in the ass.

I thought it’d be better to have named deployment targets, so instead of changing a bunch of environment variables in order to set a bunch of options, we could just name a target and go. I borrowed the idea from Git remotes, and started a database URI spec (mentioned previously) to simplify things a bit. Here’s how it works. Say you have a PostgreSQL Sqitch project called “Flipr”. While doing development, you’ll want to have a local database to deploy to. There is also a QA database and a production database. Use the target command to set them up:

sqitch target add dev db:pg:flipr
sqitch target add qa db:pg://sqitch@qa.example.com/flipr
sqitch target add prod db:pg://sqitch@db.example.com/flipr

Like Git remotes, we just have names and URIs. To deploy to a database, just name it:

sqitch deploy dev

Want to deploy to QA? Name it:

sqitch deploy qa

This works with any of the commands that connect to a database, including revert and status:

sqitch revert --to @HEAD^^ dev
sqitch status prod

The great thing about this feature is that the configuration is all stored in the project Sqitch configuration file. That means you can commit all the connection URIs for all likely targets in directly to the project repository. If they change, just change them in the config, commit, and push.

Targets don’t always have to be configured in advance, of course. The names essentially stand in for the URIs, so you can connect to an unnamed target just by using a URI:

sqitch log db:postgres://db1.example.net/flipr_export

Of course there are still defaults specific to each engine. I generally like to set the “dev” target as the default deploy target, like so:

sqitch config core.pg.target dev

This sets the “dev” target as the default for the PostgreSQL engine. So now I can do stuff with the “dev” target without mentioning it at all:

sqitch rebase --onto HEAD^4

Named targets may also have a couple other attributes associated with them:

  • client: The command-line client to use for a target.
  • registry: The name of the Sqitch registry schema or database, which defaults to, simply, sqitch.

Now that I’ve started using it, I can think of other things I’d like to add to targets in the future, including:

Pretty cool stuff ahead, IMO. I’m grateful to work for letting me hack on Sqitch.

More about…

Toward A Database URI Standard

As part of my effort to improve Sqitch, I plan to add support for specifying deployment targets via URIs. Inspired by Git remotes, targets will greatly simplify the specification of databases to update — especially when stored as named targets in the configuration file.

Before implementing it, though, I started casting about for a standard URI Scheme for database connections. Imagine my surprise1 to find that there is none! The closest thing to a standard is JDBC URLs. Formally, their format is simply:

jdbc:<jdbc-specific-stuff>

Turns out that JDBC URLs are barely URLs at all. I mean, fine, according to RFC 3986 they start with the jdbc: scheme followed by whatever. According to the JDBC docs, what comes after the scheme is defined as follows:

jdbc:<subprotocol>:<subname>

The “subprotocol” is simply a driver name, while the the format of the “subname can vary, depending on the subprotocol, and it can have any internal syntax the driver writer chooses, including a subsubname.” In other words, it can be anything at all. Not very satisfying, or particularly “standard.”2

In poking around the net, however, I found a fair number of database URI formats defined by various projects:

All very similar, right? Most database engines support all or a subset of these connection parts in common:

  • username
  • password
  • host address
  • port
  • database name
  • configuration parameters

So why not define a standard database URI format with all those parts, and use them where appropriate for each engine? It’s all right there, just like http URLs.

The Proposal

Here’s my proposal. Formally, it’s an opaque URI like JDBC. All database URIs start with the scheme db:. But in this case, the opaque part is an embedded URI that may be in one of two formats:

engine://[username[:password]@]host[:port][/dbname][?params]
engine:[dbname][?params]

In other words, a pretty typical http- or mailto-style URI format. We embed it in a db: URI in order to identify the URI as a database URI, and to have a single reasonable scheme to register. Informally, it’s simplest to think of a database URI as a single URI starting with the combination of the scheme and the engine, e.g., db:mysql.

Some notes on the formats:

  • The Database URI scheme is db. Consequently, database URIs always start with db:. This is the URI scheme that defines a database URI.

  • Next comes the database engine. This part is a string naming the type of database engine for the database. It must always be followed by a colon, :. There is no formal list of supported engines, though certain implementations may specify engine-specific semantics, such as a default port.

  • The authority part is separated from the engine by a double slash, //, and terminated by the next slash or end of the URI. It consists of an optional user-information part, terminated by @ (e.g., username:password@); a required host address (e.g., domain name or IP address); and an optional port number, preceded by a colon, :.

  • The path part specifies the database name or path. For URIs that contain an authority part, a path specifying a file name must be absolute. URIs without an authority may use absolute or relative paths.

  • The optional query part, separated by a question mark, ?, contains key=value pairs separated by a semicolon, ;, or ampersand, &. These parameters may be used to configure a database connection with parameters not directly supported by the rest of the URI format.

Examples

Here are some database URIs without an authority part, which is typical for non-server engines such as SQLite, where the path part is a relative or absolute file name:

  • db:sqlite:
  • db:sqlite:foo.db
  • db:sqlite:../foo.db
  • db:sqlite:/var/db/foo.sqlite

Other engines may use a database name rather than a file name:

  • db:ingres:mydb
  • db:postgresql:template1

When a URI includes an authority part, it must be preceded by a double slash:

  • db:postgresql://example.com
  • db:mysql://root@localhost
  • db:pg://postgres:secr3t@example.net

To add the database name, separate it from the authority by a single slash:

  • db:postgresql://example.com/template1
  • db:mongodb://localhost:27017/myDatabase
  • db:oracle://scott:tiger@foo.com/scott

Some databases, such as Firebird, take both a host name and a file path. These paths must be absolute:

  • db:firebird://localhost/tmp/test.gdb
  • db:firebird://localhost/C:/temp/test.gdb

Any URI format may optionally have a query part containing key/value pairs:

  • db:sqlite:foo.db?foreign_keys=ON;journal_mode=WAL
  • db:pg://localhost:5433/postgres?client_encoding=utf8;connect_timeout=10

Issues

In discussing this proposal with various folks, I’ve become aware of a few challenges to standardization.

First, the requirement that the authority part must include a host address prevents the specification of a URI with a username that can be used to connect to a Unix socket. PostgreSQL and MySQL, among others provide authenticated socket connections. While RFC 3986 requires the host name, its predecessor, RFC 2396, does not. Furthermore, as a precedent, neither do file URIs. So I’m thinking of allowing something like this to connect to a PostgreSQL database

db:pg://postgres:secr3t@/

In short, it makes sense to allow the user information without a host name.

The second issue is the disallowing of relative file names in the path part following an authority part. The problem here is that most database engines don’t use paths for database names, so a leading slash makes no sense. For example, in db:pg:localhost/foo, the PostgreSQL database name is foo, not /foo. Yet in db:firebird:localhost/foo, the Firebird database name is a path, /foo. So each engine implementation must know whether or not the path part is a file name.

But some databases may in fact allow a path to be specified for a local connection, and a name for a remote connection. Informix appears to support such variation. So how is one to know whether the path is a file path or a named database? The two variants cannot be distinguished.

RFC 2396 is quite explicit that the path part must be absolute when following an authority part. But RFC 3986 forbids the double slash only when there is no authority part. Therefore, I think it might be best to require a second slash for absolute paths. Engines that use a simple name or relative path can have it just after the slash, while an absolute path could use a second slash:

  • Absolute: db:firebird://localhost//tmp/test.gdb
  • Relative: db:firebird://localhost/db/test.gdb
  • Name: db:postgresql://localhost/template1

That’s It

The path issue aside, I feel like this is a pretty simple proposal, and could have wide utility. I’ve already knocked out a Perl reference implementation, URI::db. Given the wide availability of URI parsers in various programming languages, I wouldn’t expect it to be difficult to port, either.

The uri-db project is the canonical home for the proposal for now, so check there for updates. And your feedback would be appreciated! What other issues have I overlooked? What have I got wrong? Let me know!


  1. As in not surprised at all. Though I was hoping!
  2. DSNs for Perl’s DBI aren’t much better: dbi:<driver>:<driver-specific-stuff>.
More about…

Indexing Nested hstore

In my first Nested hstore post yesterday, I ran a query against unindexed hstore data, which required a table scan. But hstore is able to take advantage of GIN indexes. So let’s see what that looks like. Connecting to the same database, I indexed the review column:

reviews=# CREATE INDEX idx_reviews_gin ON reviews USING GIN(review);
CREATE INDEX
Time: 360448.426 ms
reviews=# SELECT pg_size_pretty(pg_database_size(current_database()));
 pg_size_pretty 
----------------
 421 MB

Well, that takes a while, and makes the database a lot bigger (it was 277 MB unindexed). But is it worth it? Let’s find out. Oleg and Teodor’s patch adds support for a nested hstore value on the right-hand-side of the @> operator. In practice, that means we can specify the full path to a nested value as an hstore expression. In our case, to query only for Books, instead of using this expression:

WHERE review #> '{product,group}' = 'Book'

We can use an hstore value with the entire path, including the value:

WHERE review @> '{product => {group => Book}}'

Awesome, right? Let’s give it a try:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => Book}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 849.681 ms

That time looks better than yesterday’s, but in truth I first ran this query just before building the GIN index and got about the same result. Must be that Mavericks is finished indexing my disk or something. At any rate, the index is not buying us much here.

But hey, we’re dealing with 1998 Amazon reviews, so querying against books probably isn’t very selective. I don’t blame the planner for deciding that a table scan is cheaper than an index scan. But what if we try a more selective value, say “DVD”?

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => DVD}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count 
---------------------+----------------+-------
                   1 |           4.27 |  2646
                   2 |           4.44 |  4180
                   3 |           4.53 |  1996
                   4 |           4.38 |  2294
                   5 |           4.48 |   943
                   6 |           4.42 |   738
(6 rows)

Time: 73.913 ms

Wow! Under 100ms. That’s more like it! Inverted indexing FTW!

More about…

Testing Nested hstore

I’ve been helping Oleg Bartunov and Teodor Sigaev with documentation for the forthcoming nested hstore patch for PostgreSQL. It adds support for arrays, numeric and boolean types, and of course arbitrarily nested data structures. This gives it feature parity with JSON, but unlike the JSON type, its values are stored in a binary representation, which makes it much more efficient to query. The support for GiST and GIN indexes to speed up path searches doesn’t hurt, either.

As part of the documentation, we wanted to include a short tutorial, something to show off the schemaless flexibility of the new hstore. The CitusDB guys were kind enough to show off their json_fdw with some Amazon review data in a blog post a few months back; it even includes an interesting query against the data. Let’s see what we can do with it. First, load it:

> createdb reviews
> psql -d reviews -c '
    CREATE EXTENSION HSTORE;
    CREATE TABLE reviews(review hstore);
'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | sed -e 's/":/" =>/g' > /tmp/hstore.copy
> time psql -d reviews -c "COPY reviews FROM '/tmp/hstore.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 13.059 total

13 seconds to load 589,859 records from a file – a little over 45k records per second. Not bad. Let’s see what the storage looks like:

> psql -d reviews -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 277 MB

The original, uncompressed data is 208 MB on disk, so roughly a third bigger given the overhead of the database. Just for fun, let’s compare it to JSON:

> createdb reviews_js
> psql -d reviews_js -c 'CREATE TABLE reviews(review json);'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | > /tmp/json.copy
> time psql -d reviews_js -c "COPY reviews FROM '/tmp/json.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 7.434 total
> psql -d reviews_js -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 239 MB

Almost 80K records per second, faster, I’m guessing, because the JSON type doesn’t convert the data to binary representation its way in. JSON currently uses less overhead for storage, aw well; I wonder if that’s the benefit of TOAST storage?

Let’s try querying these guys. I adapted the query from the CitusDB blog post and ran it on my 2013 MacBook Air (1.7 GHz Intel Core i7) with iTunes and a bunch of other apps running in the background [yeah, I’m lazy]). Check out those operators, by the way! Given a path, #^> returns a numeric value:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 2301.620 ms

The benefit of the native type is pretty apparent here. I ran this query several times, and the time was always between 2.3 and 2.4 seconds. The Citus json_fdw query took “about 6 seconds on a 3.1 GHz CPU core.” Let’s see how well the JSON type does (pity there is no operator to fetch a value as numeric; we have to cast from text):

reviews_js=# SELECT
    width_bucket(length(review #>> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg((review #>> '{review,rating}')::numeric), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #>> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 5530.120 ms

A little faster than the json_fdw version, but comparable. But takes well over twice as long as the hstore version, though. For queries, hstore is the clear winner. Yes, you pay up-front for loading and storage, but the payoff at query time is substantial. Ideally, of course, we would have the insert and storage benefits of JSON and the query performance of hstore. There was talk last spring at PGCon of using the same representation for JSON and hstore; perhaps that can still come about.

Meanwhile, I expect to play with some other data sets over the next week; watch this spot for more!

More about…

The Women at XOXO Rocked It

XOXO was fantastic again this year. Since it wrapped up a week ago, there have been a number of fantastic posts from the likes of Frank Chimero, Maciej Ceglowski, and Kelly Kend; you should go read them. I was particularly moved by the unabashed vulnerability and honesty of the speakers, and their willingness to recruit the audience to confront the debilitating effects of impostor syndrome. I expect to write more about the festival in the future. Or maybe I’ll just create something awesome in the coming year, instead.

One thing struck me after the first day of talks that I’ve not noticed addressed elsewhere. While women were under-represented in the audience (though far better than at most tech conferences), but they were pretty well-represented among the speakers. And the women speakers — oh man. They. Kicked. Ass.

Erika Moen openly shared her quest for sexual identity via comics. Vi Hart hilariously described how to make money on YouTube — and Andy told her she could keep going for as long as she wanted. Molly Crabapple raised important yet seldom discussed issues around the independence of artists and the availability of capital on the internet. Julie Uhrman humbly shared all the ways in which Ouya has failed, and in doing so making it better. And Christina Xu discussed [BreadPig]’s objective to enable artists to make a living online without exploitation.

These amazing people weren’t at XOXO because they’re women; they weren’t there to represent women as a separate entity, or to talk about the under-representation of women in tech. No, they were there because they’ve created incredible things and wanted to share. Their energy was palpable. They just came out on stage and totally fucking rocked it.

This is how it ought be. You make something. You’re excited about it. Your energy infects the audience. And your gender and ethnicity have nothing to do with it. It’s moving simply to be amazing.

Alas, XOXO is the outlier here. But it points to the future, and I’m excited to help push it forward.

This post originally appeared on Svbtle.

The Power of Enums

Jim Mlodgenski on using Enums in place of references to small lookup tables:

I saw something else I didn’t expect: […] There was a 8% increase in performance. I was expecting the test with the enums to be close to the baseline, but I wasn’t expecting it to be faster. Thinking about it, it makes sense. Enums values are just numbers so we’re effectively using surrogate keys under the covers, but the users would still the the enum labels when they are looking at the data. It ended up being a no brainer to use enums for these static tables. There was a increase in performance while still maintaining the integrity of the data.

I’ve been a big fan of Enums since Andrew and Tom Dunstan released a patch for them during the PostgreSQL 8.2 era. Today they’re a core feature, and as of 9.1, you can even modify their values! You’re missing out if you’re not using them yet.

More about…

Whither Impala Fault Tolerance?

Justin Erickson on the Cloudera Blog

In December 2012, while Cloudera Impala was still in its beta phase, we provided a roadmap for planned functionality in the production release. In the same spirit of keeping Impala users, customers, and enthusiasts well informed, this post provides an updated roadmap for upcoming releases later this year and in early 2014.

Impala is a pretty nice-looking SQLish query engine that runs on Hadoop. It provides the same basic interface as Hive, but circumvents MapReduce to access data directly from data nodes in parallel. But, in my opinion, to be useful as a real-time query engine, it needs fault tolerance. From the Impala FAQ, under the list of unsupported features:

Fault tolerance for running queries (not currently). In the current release, Impala aborts the query if any host on which the query is executing fails. In the future, we will consider adding fault tolerance to Impala, so that a running query would complete even in the presence of host failures.

Sounds like an unfortunate issue. Since it’s pretty typical for data nodes to go down, this seems like an essential feature. Products like CitusDB offer fault tolerance:

Does CitusDB recover from failures?

Yes. The CitusDB master node intelligently re-routes the work on any failed nodes to the remaining nodes in real-time. Since the underlying data are kept in fixed-size blocks in HDFS, a failed node’s work can evenly be distributed among the remaining nodes in the cluster.

That sound exactly right. I’m excited about Citus, and if it adds solid support for more data formats, such as ORC and Parquet, it may well be the way to go. But Impala will be a nice alternative if it can get fault tolerance figured out. I’m disappointed it’s not on the road map.

Lexical Subroutines

Ricardo Signes:

One of the big new experimental features in Perl 5.18.0 is lexical subroutines. In other words, you can write this:

my sub quickly { ... }
my @sorted = sort quickly @list;

my sub greppy (&@) { ... }
my @grepped = greppy { ... } @input;

These two examples show cases where lexical references to anonymous subroutines would not have worked. The first argument to sort must be a block or a subroutine name, which leads to awful code like this:

sort { $subref->($a, $b) } @list

With our greppy, above, we get to benefit from the parser-affecting behaviors of subroutine prototypes.

My favorite tidbit about this feature? Because lexical subs are lexical, and method-dispatch is package-based, lexical subs are not subject to method lookup and dispatch! This just might alleviate the confusion of methods and subs, as chromatic complained about just yesterday. Probably doesn’t solve the problem for imported subs, though.

More about…