Just a Theory

By David E. Wheeler

Posts about Relational Algebra

pgTAP Set-Testing Update

I’ve been thinking more about testing SQL result sets and how to name functions that do such testing, and I’ve started to come to some conclusions. Some of the constraints I’m looking at:

  • Cursors are required for tests where the order of the results returned is important. It might be best for such functions to create the cursors themselves.

  • For comparisons where order isn’t important, the results of each SQL statement must be inserted into a temporary table and then the table used for the comparisons. Otherwise, each statement would be executed twice, as is required to calculate symmetric difference. By executing each once and storing the results in a temporary table, we get around this issue (and indeed, this one of the cases that Epic’s global() function addresses).

  • When order is not important, the most efficient way to compare result sets is with symmetric difference. However, said comparison is a set comparison, meaning that duplicate rows are ignored. So if set A has 3 rows and set B has four, but two of those four are identical, then sets A and B can still be equivalent.

I’m starting to think that I would have two basic result set testing functions, set_eq() and bag_eq(). The former would do a set comparison, while the latter would require that duplicate rows be present in both result sets. Unfortunately, that would mean that it would be difficult for set_eq to have a variation that tests ordered sets, as symmetric difference ignores relational ordering. And a simple bag_eq() function would require that the relations be ordered, as it would iterate over each row in each relation in turn and compare row to row. But as I pointed out to commenter “@dave0,” bags are not inherently ordered, so it would be imposing a requirement that’s not necessarily appropriate.

This is starting to drive me a bit nuts.

I think that there are ways to enforce an ordered comparison on a set and an unordered comparison on a bag, but both would be pretty inefficient. Maybe I should do it anyway, include the appropriate caveats in the documentation, and then improve when feasible in the future. In that case, what I’d be looking at is something like this:

set_eq( sql, sql )
Test for set equivalence of two SQL statements.
oset_eq( sql, sql )
Test for ordered set equivalence of two SQL statements.
bag_eq( sql, sql )
Test for row equivalence of two SQL statements.
obag_eq( sql, sql )
Test for ordered row equivalence of two SQL statements.

The preferred tests would be set_eq() and obag_eq(). If a single word is passed to any of these functions, it’s assumed to be a prepared statement. Cursors would be created internally for the functions that require ordered comparison. The non-ordered versions would create temporary tables to hold the values and then use those tables for the comparisons. bag_eq() would also construct cursors on the temporary tables to ensure that rows could be compared in the same order in which they were generated by the SQL statement.

Interface-wise, perhaps a boolean would be preferred to indicate whether or not to compare the rows in an ordered fashion? That would be:

set_eq( sql, sql, bool )
Test for set equivalence of two SQL statements. The sets must be in the same order if the boolean argument is true.
bag_eq( sql, sql, bool )
Test for row equivalence of two SQL statements. The bags must have their rows in the same order if the boolean argument is true.

I like that there are fewer functions this way, but is it harder to remember what the boolean is for? (It would not be required, and would default to false for both functions). Thoughts?

By the way, I would likely throw in a couple of other resultset-comparing functions:

set_includes( sql, sql )
Test that the set returned by the first statement includes the rows returned by the second statement.
set_excludes( sql, sql )
Test that the set returned by the first statement excludes the rows returned by the second statement.
bag_includes( sql, sql )
Test that the bag returned by the first statement includes the rows returned by the second statement, including duplicates.
bag_excludes( sql, sql )
Test that the bag returned by the first statement excludes the rows returned by the second statement, including duplicates.

Seem useful? Please leave a comment with your thoughts.

Looking for the comments? Try the old layout.