Enforcing a Set of Values

Enumerate Me

I love enums. They’re a terrific way to quickly create self-documenting data types that represent a set of values, and the nice thing is that the underlying values are stored as integers, making them very space- and performance-efficient. A typical example might be a workflow approval process for publishing magazine articles. You create it like so:

CREATE TYPE article_states AS ENUM (
    'draft', 'copy', 'approved', 'published'
);

Nice: we now have a simple data type that’s self-documenting. An an important feature of enums is that the ordering of values is the same as the declared labels. For a workflow such as this, it makes a lot of sense, because the workflow states are inherently ordered: “draft” comes before “copy” and so on.

Unfortunately, enums aren’t a panacea. I would use them all over the place if I could, but, alas, the value-set data types I tend to need tend not to have inherently ordered values other than the collation order of the text. For example, say that we need a table describing people’s faces. Using an enum to manage eye colors might look something like this:

CREATE TYPE eye_color AS ENUM ( 'blue', 'green', 'brown' );

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color eye_color NOT NULL
);

Nice, huh? So let’s insert a few values and see what it looks like:

INSERT INTO faces (name, eye_color)
VALUES ('David', 'blue' ),
       ('Julie', 'green' ),
       ('Anna', 'blue' ),
       ('Noriko', 'brown' )
;

So let’s look at the data ordered by the enum:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
--------+-----------
 David  | blue
 Anna   | blue
 Julie  | green
 Noriko | brown

Hrm. That’s not good. I forgot to put “green” after “brown” when I created the enum. Oh, and I forgot the color “hazel”:

% INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );
ERROR:  invalid input value for enum eye_color: "hazel"

Well, nice to know that it’s enforced, and that message is really helpful. But the real problem is that we run into the inherent ordering of enum labels, and now we need to adjust the enum to meet our needs. Here’s how to do it:

ALTER TABLE faces RENAME eye_color TO eye_color_tmp;
ALTER TABLE faces ALTER eye_color_tmp TYPE TEXT;
DROP TYPE eye_color;
CREATE TYPE eye_color AS ENUM ( ‘blue’, ‘brown’, ‘green’, ‘hazel’ );
ALTER TABLE faces ADD eye_color eye_color;
UPDATE faces SET eye_color = eye_color_tmp::eye_color;
ALTER TABLE faces ALTER eye_color SET NOT NULL;
ALTER TABLE faces DROP column eye_color_tmp;

Yikes! I have to rename the column, change its type to TEXT, drop the enum, create a new enum, and then copy all of the data into the new column before finally dropping the old column. If I have a lot of data, this will not be very efficient, requiring that every single row be rewritten. Still, it does work:

% INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );
% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
--------+-----------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green
 Kat    | hazel

The upshot is that enums are terrific if you have a very well-defined set of values that are inherently ordered (or where order is not important) and that are extremely unlikely to change. Perhaps someday PostgreSQL will have a more robust ALTER TYPE that allows enums to be more efficiently reorganized, but even then it seems likely that re-ordering values will require a table rewrite.

Lookup to Me

Another approach to handling a type as a set of values is to take advantage of the relational model and create store the values in a table. Going with the faces example, it looks like this:

CREATE TABLE eye_colors (
    eye_color TEXT PRIMARY KEY
);

INSERT INTO  eye_colors VALUES( 'blue' ), ('green'), ('brown' );

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color TEXT REFERENCES eye_colors(eye_color)
);

We can use this table much as we did before:

INSERT INTO faces (name, eye_color)
VALUES ('David', 'blue' ),
       ('Julie', 'green' ),
       ('Anna', 'blue' ),
       ('Noriko', 'brown' )
;

And of course we can get the rows back properly ordered by eye_color, unlike the original enum example:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

Cool! But there are a couple of downsides. One is that you’re adding a bit of I/O overhead to every update. Most likely you won’t have very many values in the eye_colors table, so given PostgreSQL’s caching, this isn’t a big deal. A bigger deal is error handling:

INSERT INTO eye_colors VALUES ('hazel');
ERROR:  insert or update on table "faces" violates foreign key constraint "faces_eye_color_fkey"

That’s not an incredibly useful error message. One might ask, without knowing the schema, what has an eye color has to do with a foreign key constraint? At least looking at the tables can tell you a bit more:

% \dt
          List of relations
 Schema |    Name    | Type  | Owner 
----+------+----+----
 public | eye_colors | table | david
 public | faces      | table | david

A quick look at the eye_colors table will tell you what’s going on, and you can figure out that you just need to add a new row:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

So it is self-documenting, but unlike enums it doesn’t do a great job of it. Plus if you have a bunch of set-constrained value types, you can end up with a whole slew of lookup tables. This can make it harder to sort the important tables that contain actual business data from those that are just lookup tables, because there is nothing inherent in them to tell the difference. You could put them into a separate schema, of course, but still, it’s not exactly intuitive.

Given these downsides, I’m not a big fan of using lookup tables for managing what is in fact a simple list of allowed values for a particular column unless those values change frequently. So what else can we do?

Constrain Me

A third approach is to use a table constraint, like so:

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color TEXT NOT NULL,
    CONSTRAINT valid_eye_colors CHECK (
        eye_color IN ( 'blue', 'green', 'brown' )
    )
);

No lookup table, no inherent ENUM ordering. And in regular usage it works just like the lookup table example. The usual INSERT and SELECT once again yields:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

The error message, however, is a bit more helpful:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
ERROR:  new row for relation "faces" violates check constraint "valid_eye_colors"

A check constraint violation on eye_color is much more informative than a foreign key constraint violation. The downside to a check constraint, however, is that it’s not as self-documenting. You have to look at the entire table in order to find the constraint:

% \d faces
                             Table "public.faces"
  Column   |  Type   |                        Modifiers                        
------+-----+-----------------------------
 face_id   | integer | not null default nextval('faces_face_id_seq'::regclass)
 name      | text    | not null default ''::text
 eye_color | text    | not null
Indexes:
    "faces_pkey" PRIMARY KEY, btree (face_id)
Check constraints:
    "valid_eye_colors" CHECK (eye_color = ANY (ARRAY['blue', 'green', 'brown']))

There it is at the bottom. Kind of tucked away there, eh? At least now we can change it. Here’s how:

ALTER TABLE faces DROP CONSTRAINT valid_eye_colors;
ALTER TABLE faces ADD CONSTRAINT valid_eye_colors CHECK (
    eye_color IN ( 'blue', 'green', 'brown', 'hazel' )
);

Not as straight-forward as updating the lookup table, and much less efficient (because PostgreSQL must validate that existing rows don’t violate the constraint before committing the constraint). But it’s pretty simple and at least doesn’t require the entire table be UPDATEd as with enums. For occasional changes to the value list, a table scan is not a bad tradeoff. And of course, once that’s done, it just works:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

So this is almost perfect for our needs. Only poor documentation persists as an issue.

This is My Domain

To solve that problem, switch to domains. A domain is simply a custom data type that inherits behavior from another data type and to which one or more constraints can be added. It’s pretty simple to switch from the table constraint to a domain:

CREATE DOMAIN eye_color AS TEXT
CONSTRAINT valid_eye_colors CHECK (
    VALUE IN ( 'blue', 'green', 'brown' )
);

CREATE TABLE faces (
    face_id SERIAL PRIMARY KEY,
    name TEXT NOT NULL DEFAULT '',
    eye_color eye_color NOT NULL
);

Nice table declaration, eh? Very clean. Looks exactly like the enum example, in fact. And it works as well as the table constraint:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green

A constraint violation is a bit more useful than with the table constraint:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
ERROR:  value for domain eye_color violates check constraint "valid_eye_colors"

This points directly to the domain. It’d be nice if it mentioned the violating value the way the enum error did, but at least we can look at the domain out like so:

\dD eye_color
                                                        List of domains
 Schema |   Name    | Type | Modifier |                                         Check                                          
----+------+---+-----+--------------------------------------------
 public | eye_color | text |          | CHECK (VALUE = ANY (ARRAY['blue', 'green', 'brown', 'hazel']))

None of the superfluous stuff about the entire table to deal with, just the constraint, thank you very much. Changing it is just as easy as changing the table constraint:

ALTER DOMAIN eye_color DROP CONSTRAINT valid_eye_colors;
ALTER DOMAIN eye_color ADD CONSTRAINT valid_eye_colors CHECK (
    VALUE IN ( 'blue', 'green', 'brown', 'hazel' )
);

Yep, you can alter domains just as you can alter tables. And of course now it will work:

INSERT INTO eye_colors VALUES ('hazel');
INSERT INTO faces (name, eye_color) VALUES ('Kat', 'hazel' );

And as usual the data is well-ordered when we need it to be:

% SELECT name, eye_color FROM faces ORDER BY eye_color;
  name  | eye_color 
----+------
 David  | blue
 Anna   | blue
 Noriko | brown
 Julie  | green
 Kat    | hazel

And as an added bonus, if you happened to need an eye color in another table, you can just use the same domain and get all the proper semantics. Sweet!

Color Me Happy

Someday I’d love to see support for a PostgreSQL feature like enums, but allowing an arbitrary list of strings that are ordered by the contents of the text rather than the order in which they were declared, and that’s efficient to update. Maybe it could use integers for the underlying storage, too, and allow values to be modified without a table rewrite. Such would be the ideal for this use case. Hell, I’d find it much more useful than enums.

But domains get us pretty close to that without too much effort, so maybe it’s not that important. I’ve tried all of the above approaches and discussed it quite a lot with my colleagues before settling on domains, and I’m quite pleased with it. The only caveat I’d have is that it’s not to be used lightly. If the value set is likely to change fairly often (at least once a week, say), then you’d be better off with the lookup table.

In short, I recommend:

  • For an inherently ordered set of values that’s extremely unlikely to ever change, use an enum.
  • For a set of values that won’t often change and has no inherent ordering, use a domain.
  • For a set of values that changes often, use a lookup table.

What do you use to constrain a column to a defined set of unordered values?

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Comments & Trackbacks

Dave Rolsky wrote:

I think the decision between domain & lookup table is based on more than just the frequency of change.

It also depends on who is changing it and why. If you want to allow privileged (but non-geek) users to change it as they see fit, you need a lookup table (and an admin table).

OTOH, if you only change it as part of changing the app itself, and it's okay to only change it as part of a standard deployment process (which can include DDL changes), then a domain will work well.

But I could imagine the lookup table still being necessary even if changes are very infrequent. This is especially true if you're building an app that other people will install and use. You want to encourage customization through configuration & data changes, rather than code changes (effectively requiring a fork for each customization).

Dave Rolsky wrote:

Doh, wish I could edit my comments. I meant "an admin interface", not "an admin table".

Simon wrote:

Some remarks

First of all dropping and creating a constraint on a table should be made in a transaction (just to be sure that there won't be any wrong insert between the two queries).

Such changing a domain constraint is also not so efficient (even worse than altering the table constraint) as all tables, where the domain is used, have to be checked if there is a record that contains a wrong string in the domain column. I haven't checked that but this should lead to some heavy locks (including deadlocks) if the domain is used in many tables.

Using the domain could also lead to much greater table size as the strings can be quite long. Any enum type occupies just four bytes on disk. Of course that all depends on the database and data and this can be unnoticeable.

Jeff Davis wrote:

What does it mean to change an enum?

When adding the eye color "hazel" to the existing set of options "blue, green, and brown", is it really safe to just add the new option?

For instance, prior records may have used "green" instead of hazel because the "hazel" option didn't exist. So changing the available options might affect the meaning of all previous records.

Andrew Hammond wrote:

Lookup tables are also handy when you're trying to write a UI and need a list of acceptable options. I imagine you could generate such a list from a domain or enum, too, but it'd probably take a little more effort. I've generally gone with lookup tables because I often want to add extra meta-data to options, like say a default eye-color, and if you've already got a table, adding another column is pretty easy.

Bill Karwin wrote:

Andrew beat me to it

I was going to comment on additional drawbacks of using ENUM or DOMAIN or TYPE or CHECK constraints. Andrew got two of the three:

  1. Querying the current list of allowed values is easy with a lookup table, but requires intricate queries against system views with the other solutions.

  2. A lookup table makes it easy to define attributes for each value. This is not possible with the other solutions.

And the third:

  1. Changing a list of permitted values requires knowing the previous state of the list. Easy enough to eyeball it if you're making the change ad hoc, but if you need to do it programmatically, see the #1 weakness above. In a lookup table, you can INSERT, DELETE, or UPDATE values individually without touching the rest of the list.

Scott Bailey wrote:

I've never actually used enums in production. But it is not too difficult to add a value to an enum by adding an entry to the pg_enum table. Removing or reordering values is another story. And it is easy enough to get the list of values for an enum by using enum_range().

Xaprb wrote:

I don't get why a SELECT causes a foreign key error about INSERT or UPDATE violations. Did I miss something?

Theory wrote:

Replies

@Dave—

Yeah, I was assuming that it was purely to constrain values, not that the values would be modified via an admin interface. For that the lookup table is clearly the best idea. I'd still probably want to put it into a separate schema, though.

@Simon—

Yeah, I should get in the habit of putting the transaction stuff in my blog examples. I certainly do that in real life! And of course I agree that changing a domain is less efficient if you have the same domain in multiple tables. But if you had a similar constraint in multiple tables, I expect you'd have to update it in them all anyway, so it's likely a wash. I'm with you on the space issue, which is why I wish we had something like ENUM but that used text collation for ordering and was painless to modify in-place.

@Jeff—Excellent questions, which boil down to the answer: “It's important to always know your data!” But I hope that doesn't detract from the implementation discussion.

@Andew—FYI, you can specify defaults for domains, too, like so:

CREATE DOMAIN eye_color AS TEXT
CONSTRAINT valid_eye_colors DEFAULT 'blue' CHECK (
    VALUE IN ( 'blue', 'green', 'brown' , 'hazel')
);

@Bill—You and andrew are correct. If you need to query for the values in your app, then you're better off with a lookup table or an enum if you can get away with it.

@Scott—Quit right about adding values via the system tables, but direct updating of the system tables is generally frowned-upon. I'l have to keep enum_range() in mind, though, nice find.

@Xaprb—Nice catch! It was supposed to be an INSERT. Fixed, thanks!

—Theory

Josh Berkus wrote:

David,

I think you chose a bad example for Domains. Something like eye color is clearly not tied to any programming changes, and as such should probably be a lookup list table editable by the admin user.

Where domains are a clearer win is for case where the values in the list are tied to programmed behavior, so a code deployment is required to change them anyway, BUT it's conceiveable that the values might be appended in the future. For example, you could have an "invoice_frequency" value, which offers ( 'Monthly', 'Quarterly', 'Annual' ). It's conceiveable that you might someday want to add 'Bimonthly' to that, so you don't want a ENUM, but you also don't want the admin user able to casually add to the list, since the application would break.

So the decision tree is:

List will never change, and ordering is preset or doesn't matter: use an ENUM

List might change but changes require programming, OR list will not change but needs to be alpha sorted: use a DOMAIN

List will change, and changes do not require other programming: use a lookup table.

Discussion is now closed.

Powered by KinoSearch