pgTAP Set-Testing Update

I've been thinking more about testing SQL result sets and how to name functions that do such testing, and I've started to come to some conclusions. Some of the constraints I'm looking at:

  • Cursors are required for tests where the order of the results returned is important. It might be best for such functions to create the cursors themselves.

  • For comparisons where order isn't important, the results of each SQL statement must be inserted into a temporary table and then the table used for the comparisons. Otherwise, each statement would be executed twice, as is required to calculate symmetric difference. By executing each once and storing the results in a temporary table, we get around this issue (and indeed, this one of the cases that Epic's global() function addresses).

  • When order is not important, the most efficient way to compare result sets is with symmetric difference. However, said comparison is a set comparison, meaning that duplicate rows are ignored. So if set A has 3 rows and set B has four, but two of those four are identical, then sets A and B can still be equivalent.

I'm starting to think that I would have two basic result set testing functions, set_eq() and bag_eq(). The former would do a set comparison, while the latter would require that duplicate rows be present in both result sets. Unfortunately, that would mean that it would be difficult for set_eq to have a variation that tests ordered sets, as symmetric difference ignores relational ordering. And a simple bag_eq() function would require that the relations be ordered, as it would iterate over each row in each relation in turn and compare row to row. But as I pointed out to commenter “@dave0,” bags are not inherently ordered, so it would be imposing a requirement that's not necessarily appropriate.

This is starting to drive me a bit nuts.

I think that there are ways to enforce an ordered comparison on a set and an unordered comparison on a bag, but both would be pretty inefficient. Maybe I should do it anyway, include the appropriate caveats in the documentation, and then improve when feasible in the future. In that case, what I'd be looking at is something like this:

set_eq( sql, sql )
Test for set equivalence of two SQL statements.
oset_eq( sql, sql )
Test for ordered set equivalence of two SQL statements.
bag_eq( sql, sql )
Test for row equivalence of two SQL statements.
obag_eq( sql, sql )
Test for ordered row equivalence of two SQL statements.

The preferred tests would be set_eq() and obag_eq(). If a single word is passed to any of these functions, it's assumed to be a prepared statement. Cursors would be created internally for the functions that require ordered comparison. The non-ordered versions would create temporary tables to hold the values and then use those tables for the comparisons. bag_eq() would also construct cursors on the temporary tables to ensure that rows could be compared in the same order in which they were generated by the SQL statement.

Interface-wise, perhaps a boolean would be preferred to indicate whether or not to compare the rows in an ordered fashion? That would be:

set_eq( sql, sql, bool )
Test for set equivalence of two SQL statements. The sets must be in the same order if the boolean argument is true.
bag_eq( sql, sql, bool )
Test for row equivalence of two SQL statements. The bags must have their rows in the same order if the boolean argument is true.

I like that there are fewer functions this way, but is it harder to remember what the boolean is for? (It would not be required, and would default to false for both functions). Thoughts?

By the way, I would likely throw in a couple of other resultset-comparing functions:

set_includes( sql, sql )
Test that the set returned by the first statement includes the rows returned by the second statement.
set_excludes( sql, sql )
Test that the set returned by the first statement excludes the rows returned by the second statement.
bag_includes( sql, sql )
Test that the bag returned by the first statement includes the rows returned by the second statement, including duplicates.
bag_excludes( sql, sql )
Test that the bag returned by the first statement excludes the rows returned by the second statement, including duplicates.

Seem useful? Please leave a comment with your thoughts.

Backtalk

Tom Davis wrote:

Instead of a boolean, the function could expect a string case insensitively matching 'ordered' or 'unordered'. It seems unlikely that the extra overhead of doing that test will have a significant impact on resource usage or run-time while providing a much clearer guide to users and maintainers.

Likewise having the third argument be required will also provide better guidance as users will no doubt have difficulty remembering what the default is. And while typing , 'unordered' is certainly more time consuming than not typing it but having to look up the function.

If for some reason you do need to use a boolean, having the boolean describe ORDERED might be better as some people seem to have difficulty with the double negative aspect of UNORDERED = FALSE

Of course all of this is making me think about the fact that that third argument is really check order and the difference between bag\_eq() and set\_eq() is really disregard duplicates means that you might have a single function with two required arguments and optionally a list of additional attributes sql_eq( sql, sql [, 'compare order'] [, 'disregard duplicates'] ) or some such.

Xaprb wrote:

A similar problem

I am working on a somewhat related problem. You can read all about it here: http://code.google.com/p/maatkit/wiki/mk_upgrade

This is for MySQL, of course -- and it's not a test suite per se, so there isn't 100% overlap. But does this give ideas or add color to anything you're working on? On the other hand, if you have suggestions, I welcome them very warmly.

Theory wrote:

Replies

@Tom—

Thanks, I'll consider using "ordered" or "unordered." That may indeed be the way to go, as it's nice and clear. I'll have to think about whether I'd want it to be required, though. And yeah, maybe another argument for ignoring duplicates is best, as well. Hrm…stuff to ponder.

@Xaprb—I like the idea of CHECKSUMing a table in MySQL, although if you want decent diagnostics you'd still have to iterate over all the data on failure anyway.

BTW, I got an inquiry as to whether there was anything like pgTAP for MySQL. There isn't to the best of my knowledge; have you heard of anything? If not, any interest in implementing myTAP for such folks?

—Theory

Joseph Brenner wrote:

can ordered/unordered also apply to includes?

An odd thought: couldn't you also use the ordered/unordered parameter with set_includes and bag_includes? The ordered comparison would allow intervening rows, but insist that the rows of the second result be found in the given sequence.

Theory wrote:

Re: can ordered/unordered also apply to includes?

@Joseph—

Yes, I had the same thought, and I would likely work that in as well, though maybe not at first. Gotta see how difficult this stuff is going to be.

—Theory