Just a Theory

Black lives matter

Indexing Nested hstore

In my first Nested hstore post yesterday, I ran a query against unindexed hstore data, which required a table scan. But hstore is able to take advantage of GIN indexes. So let’s see what that looks like. Connecting to the same database, I indexed the review column:

reviews=# CREATE INDEX idx_reviews_gin ON reviews USING GIN(review);
CREATE INDEX
Time: 360448.426 ms
reviews=# SELECT pg_size_pretty(pg_database_size(current_database()));
 pg_size_pretty 
----------------
 421 MB

Well, that takes a while, and makes the database a lot bigger (it was 277 MB unindexed). But is it worth it? Let’s find out. Oleg and Teodor’s patch adds support for a nested hstore value on the right-hand-side of the @> operator. In practice, that means we can specify the full path to a nested value as an hstore expression. In our case, to query only for Books, instead of using this expression:

WHERE review #> '{product,group}' = 'Book'

We can use an hstore value with the entire path, including the value:

WHERE review @> '{product => {group => Book}}'

Awesome, right? Let’s give it a try:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => Book}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 849.681 ms

That time looks better than yesterday’s, but in truth I first ran this query just before building the GIN index and got about the same result. Must be that Mavericks is finished indexing my disk or something. At any rate, the index is not buying us much here.

But hey, we’re dealing with 1998 Amazon reviews, so querying against books probably isn’t very selective. I don’t blame the planner for deciding that a table scan is cheaper than an index scan. But what if we try a more selective value, say “DVD”?

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => DVD}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count 
---------------------+----------------+-------
                   1 |           4.27 |  2646
                   2 |           4.44 |  4180
                   3 |           4.53 |  1996
                   4 |           4.38 |  2294
                   5 |           4.48 |   943
                   6 |           4.42 |   738
(6 rows)

Time: 73.913 ms

Wow! Under 100ms. That’s more like it! Inverted indexing FTW!