Is PostgreSQL Your Next JSON Database?

TL;DR: Betteridge's law applies unless your JSON is fairly unchanging and needs to be queried a lot.

With the most recent version of PostgreSQL gaining ever more JSON capabilities, we've been asked if PostgreSQL could replace MongoDB as a JSON database. There's a short answer to that, but we'd prefer to show you. Ah, a question from the audience...

“Didn't PostgreSQL already have some JSON support?”

Yes, it did. Before PostgreSQL 9.4 there was the JSON data type and that's still available. It lets you do this:

>CREATE TABLE justjson ( id INTEGER, doc JSON)
>INSERT INTO justjson VALUES ( 1, '{
    "name":"fred",
    "address":{
        "line1":"52 The Elms",
        "line2":"Elmstreet",
        "postcode":"ES1 1ES"
        }
    }');

That stored the raw text of the JSON data in the database, complete with white space and retaining all the orders of any keys and any duplicate keys. Let's show that by looking at the data:

>SELECT * FROM justjson;
 id |               doc
----+---------------------------------
  1 | {                              +
    |     "name":"fred",             +
    |     "address":{                +
    |         "line1":"52 The Elms", +
    |         "line2":"Elmstreet",   +
    |         "postcode":"ES1 1ES"   +
    |         }                      +
    |     }
(1 row)

It has stored an exact copy of the source data. But we can still extract data from it. To do that, there's a set of JSON operators to let us refer to elements within the JSON document. So say we just want the address section, we can do:

select doc->>'address' FROM justjson;  
            ?column?
---------------------------------
 {                              +
         "line1":"52 The Elms", +
         "line2":"Elmstreet",   +
         "postcode":"ES1 1ES"   +
         }
(1 row)

The ->> operator says within doc, look up the JSON object with the following fieldname and return it as text. With a number, it would have treated it as an array index, but still returned the value as text. There's also -> to go with ->> which doesn't do that conversion to text. We need that so we can navigate into the JSON objects like so:

select doc->'address'->>'postcode' FROM justjson;  
 ?column?
----------
 ES1 1ES
(1 row)

Though there is a shorter form where we can specify a path to the data we are after using #>> and an array like this:

select doc#>>'{address,postcode}' FROM justjson;  
 ?column?
----------
 ES1 1ES
(1 row)

By preserving the entire document the JSON data type made it easy to work with exact copies of JSON documents and pass them on without loss. But with that exactness comes a cost, a loss of efficency, and with that comes an inability to index.. So although it's convenient to preserve and parse JSON documents, there was still plenty of room for improvement and thats where JSONB comes in.

“What's different in JSONB?”

Well, with JSONB it turns the JSON document into a hierarchy of key/value data pairs. All the white space is discarded, only the last value in a set of duplicate keys is used and the order of keys is lost to the structure dictated by the hashes in which they are stored. If we make a JSONB version of the table we just created, insert some data and look at it:

>CREATE TABLE justjsonb ( id INTEGER, doc JSONB)
>INSERT INTO justjsonb VALUES ( 1, '{
    "name":"fred",
    "address":{
        "line1":"52 The Elms",
        "line2":"Elmstreet",
        "postcode":"ES1 1ES"
        }
    }');
>SELECT * FROM justjsonb;
 id |                                                doc
----+----------------------------------------------------------------------------------------------------
  1 | {"name": "fred", "address": {"line1": "52 The Elms", "line2": "Elmstreet", "postcode": "ES1 1ES"}}
(1 row)

We can see that all the textyness of the data has gone away, replaced with the bare minimum required to represent the data held within the JSON document. This stripping down of data means the JSONB representation moves the parsing work to when the data is inserted, but relieves any later access to the data of the task of parsing it.

“That looks a bit like HSTORE in PostgreSQL though”

Looked at as key/value pairs, then the JSONB datatype does look a bit like the PostgreSQL HSTORE extension. That's a data type for storing key/value pairs but it is an extension, where JSONB (and JSON) are in the core and HSTORE is one-deep in terms of data structure where JSON documents can have a nested elements. Also, HSTORE stores only strings while JSONB understands strings and the full range of JSON numbers.

“So what does JSONB actually get me?”

Indexing, indexing everywhere. You can't actually index a JSON datatype in PostgreSQL. You can make an index for it using expression indexes, but that'll cover you for whatever you can put in an expression. So if we wanted to we could do

create index justjson_postcode on justjson ((doc->'address'->>'postcode'));  

And the postcode, and nothing else would be indexed.

With JSONB, there's support for GIN indexes; a Generalized Inverted Index. That gives you another set of query operators to work with. These are @> contains JSON, <@ contained, ? test for string existing, ?| any strings existing and ?& all strings existing.

There are two kinds of indexes you can create with the default one, called json_ops, which supports all these operators and an index using jsonb_path_ops which only supports @>. The default index creates an index item for every key and value in the JSON, while the jsonb_path_ops only creates a hash of the keys leading up to a value and the value itself and that's a lot more compact and faster to process than the more complex default. But the default does offer more operations at the cost of consuming more space. After adding some data to our table, we can do a select looking for a particular post code. If we create the default GIN JSON index and do a query:

explain select * from justjsonb where doc @> '{ "address": { "postcode":"HA36CC" } }';  
                           QUERY PLAN
-----------------------------------------------------------------
 Seq Scan on justjsonb  (cost=0.00..3171.14 rows=100 width=123)
   Filter: (doc @> '{"address": {"postcode": "HA36CC"}}'::jsonb)
(2 rows)

We can see that it will sequentially scan the table. Now, if we create a default JSON GIN index we can see the difference it makes:

> create index justjsonb_gin on justjsonb using gin (doc);
> explain select * from justjsonb where doc @> '{ "address": { "postcode":"HA36CC" } }';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Bitmap Heap Scan on justjsonb  (cost=40.78..367.62 rows=100 width=123)
   Recheck Cond: (doc @> '{"address": {"postcode": "HA36CC"}}'::jsonb)
   ->  Bitmap Index Scan on justjsonb_gin  (cost=0.00..40.75 rows=100 width=0)
         Index Cond: (doc @> '{"address": {"postcode": "HA36CC"}}'::jsonb)
(4 rows)

It's a lot more efficient searching as you can tell by the lower cost. But the hidden cost is in the size of the index. In this case it's 41% of the size of the data. Let's drop that index and repeat the process with a jsonb_path_ops GIN index.

> create index justjsonb_gin on justjsonb using gin (doc jsonb_path_ops);
> explain select * from justjsonb where doc @> '{ "address": { "postcode":"HA36CC" } }';
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Bitmap Heap Scan on justjsonb  (cost=16.78..343.62 rows=100 width=123)
   Recheck Cond: (doc @> '{"address": {"postcode": "HA36CC"}}'::jsonb)
   ->  Bitmap Index Scan on justjsonb_gin  (cost=0.00..16.75 rows=100 width=0)
         Index Cond: (doc @> '{"address": {"postcode": "HA36CC"}}'::jsonb)
(4 rows)

The total cost is slightly lower and typically the index size should be a lot smaller. It's going to be the classic task of balancing speed and size for indexes. But it's far more efficient than sequentially scanning.

“Should I be using this as my JSON database?”

If you update your JSON documents in place, the answer is no. What PostgreSQL is very good at is storing and retrieving JSON documents and their fields. But even though you can individually address the various fields within the JSON document, you can't update a single field. Well, actually you can, but by extracting the entire JSON document out, appending the new values and writing it back, letting the JSON parser sort out the duplicates. But it's likely that you aren't going to want to rely on that.

If your active data sits in the relational schema comfortably and the JSON content is a cohort to that data then you should be fine with PostgreSQL and it's much more efficient JSONB representation and indexing capabilities. If though, your data model is that of a collection of mutable documents then you probably want to look at a database engineered primarily around JSON documents like MongoDB or RethinkDB.