Just a Theory

Trans rights are human rights

Posts about hstore

Indexing Nested hstore

In my first Nested hstore post yesterday, I ran a query against unindexed hstore data, which required a table scan. But hstore is able to take advantage of GIN indexes. So let’s see what that looks like. Connecting to the same database, I indexed the review column:

reviews=# CREATE INDEX idx_reviews_gin ON reviews USING GIN(review);
CREATE INDEX
Time: 360448.426 ms
reviews=# SELECT pg_size_pretty(pg_database_size(current_database()));
 pg_size_pretty 
----------------
 421 MB

Well, that takes a while, and makes the database a lot bigger (it was 277 MB unindexed). But is it worth it? Let’s find out. Oleg and Teodor’s patch adds support for a nested hstore value on the right-hand-side of the @> operator. In practice, that means we can specify the full path to a nested value as an hstore expression. In our case, to query only for Books, instead of using this expression:

WHERE review #> '{product,group}' = 'Book'

We can use an hstore value with the entire path, including the value:

WHERE review @> '{product => {group => Book}}'

Awesome, right? Let’s give it a try:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => Book}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 849.681 ms

That time looks better than yesterday’s, but in truth I first ran this query just before building the GIN index and got about the same result. Must be that Mavericks is finished indexing my disk or something. At any rate, the index is not buying us much here.

But hey, we’re dealing with 1998 Amazon reviews, so querying against books probably isn’t very selective. I don’t blame the planner for deciding that a table scan is cheaper than an index scan. But what if we try a more selective value, say “DVD”?

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review @> '{product => {group => DVD}}'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count 
---------------------+----------------+-------
                   1 |           4.27 |  2646
                   2 |           4.44 |  4180
                   3 |           4.53 |  1996
                   4 |           4.38 |  2294
                   5 |           4.48 |   943
                   6 |           4.42 |   738
(6 rows)

Time: 73.913 ms

Wow! Under 100ms. That’s more like it! Inverted indexing FTW!

Testing Nested hstore

I’ve been helping Oleg Bartunov and Teodor Sigaev with documentation for the forthcoming nested hstore patch for PostgreSQL. It adds support for arrays, numeric and boolean types, and of course arbitrarily nested data structures. This gives it feature parity with JSON, but unlike the JSON type, its values are stored in a binary representation, which makes it much more efficient to query. The support for GiST and GIN indexes to speed up path searches doesn’t hurt, either.

As part of the documentation, we wanted to include a short tutorial, something to show off the schemaless flexibility of the new hstore. The CitusDB guys were kind enough to show off their json_fdw with some Amazon review data in a blog post a few months back; it even includes an interesting query against the data. Let’s see what we can do with it. First, load it:

> createdb reviews
> psql -d reviews -c '
    CREATE EXTENSION HSTORE;
    CREATE TABLE reviews(review hstore);
'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | sed -e 's/":/" =>/g' > /tmp/hstore.copy
> time psql -d reviews -c "COPY reviews FROM '/tmp/hstore.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 13.059 total

13 seconds to load 589,859 records from a file – a little over 45k records per second. Not bad. Let’s see what the storage looks like:

> psql -d reviews -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 277 MB

The original, uncompressed data is 208 MB on disk, so roughly a third bigger given the overhead of the database. Just for fun, let’s compare it to JSON:

> createdb reviews_js
> psql -d reviews_js -c 'CREATE TABLE reviews(review json);'
CREATE TABLE
> gzcat customer_reviews_nested_1998.json.gz | sed -e 's/\\/\\\\/g' \
 | sed -e "s/'/''/g" | > /tmp/json.copy
> time psql -d reviews_js -c "COPY reviews FROM '/tmp/json.copy'"
COPY 589859
       0.00s user 0.00s system 0% cpu 7.434 total
> psql -d reviews_js -c 'SELECT pg_size_pretty(pg_database_size(current_database()));'
 pg_size_pretty 
----------------
 239 MB

Almost 80K records per second, faster, I’m guessing, because the JSON type doesn’t convert the data to binary representation its way in. JSON currently uses less overhead for storage, aw well; I wonder if that’s the benefit of TOAST storage?

Let’s try querying these guys. I adapted the query from the CitusDB blog post and ran it on my 2013 MacBook Air (1.7 GHz Intel Core i7) with iTunes and a bunch of other apps running in the background [yeah, I’m lazy]). Check out those operators, by the way! Given a path, #^> returns a numeric value:

reviews=# SELECT
    width_bucket(length(review #> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg(review #^> '{review,rating}'), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 2301.620 ms

The benefit of the native type is pretty apparent here. I ran this query several times, and the time was always between 2.3 and 2.4 seconds. The Citus json_fdw query took “about 6 seconds on a 3.1 GHz CPU core.” Let’s see how well the JSON type does (pity there is no operator to fetch a value as numeric; we have to cast from text):

reviews_js=# SELECT
    width_bucket(length(review #>> '{product,title}'), 1, 50, 5) title_length_bucket,
    round(avg((review #>> '{review,rating}')::numeric), 2) AS review_average,
    count(*)
FROM
    reviews
WHERE
    review #>> '{product,group}' = 'Book'
GROUP BY
    title_length_bucket
ORDER BY
    title_length_bucket;
 title_length_bucket | review_average | count  
---------------------+----------------+--------
                   1 |           4.42 |  56299
                   2 |           4.33 | 170774
                   3 |           4.45 | 104778
                   4 |           4.41 |  69719
                   5 |           4.36 |  47110
                   6 |           4.43 |  43070
(6 rows)

Time: 5530.120 ms

A little faster than the json_fdw version, but comparable. But takes well over twice as long as the hstore version, though. For queries, hstore is the clear winner. Yes, you pay up-front for loading and storage, but the payoff at query time is substantial. Ideally, of course, we would have the insert and storage benefits of JSON and the query performance of hstore. There was talk last spring at PGCon of using the same representation for JSON and hstore; perhaps that can still come about.

Meanwhile, I expect to play with some other data sets over the next week; watch this spot for more!