Thoughts About Jsonb Statistics
Introduction
Users of jsonb datatype frequently complaint that it lucks of statistics.
Naturally, today jsonb statistics is just default scalar statistics, which is
suitable for <
, <=
, =
, >=
, >
operators selectivity estimation. But
people search jsonb documents using @>
operator, expressions with >
operator, jsquery etc. This is why selectivity estimation, which people
typically get in their queries, is just a stub. This could lead wrong query plans
and bad performance. And it made us introduce hints in jsquery extension.
Thus, problem is clear. But the right solution is still unclear, at least for me. Let me discuss evident approaches to jsonb statistics and their limitations.
Collect just frequent paths
First candidate for good selectivity estimation is @>
operator. Really,
@>
is builtin operator with GIN index support. First idea that comes into
mind is to collect most frequent paths and their frequencies as jsonb
statistics. In order to understand idea of paths better let’s consider how GIN
jsonb_path_ops works. jsonb_path_ops is builtin GIN operator class, which is
most suitable for jsonb @>
operator.
Path is a sequence of key names, array indexes and referenced value.
For instance, the document
{"a": [{"b": "xyz", "c": true}, 10], "d": {"e": [7, false]}}
would be decomposed into following set of paths.
1 2 3 4 5 

In this representation of paths array indexes are replaced with #
. That
allows our search to be agnostic to them like @>
operator does. Thus, when we
have such decomposition we can say that if a @> b
then a
paths are superset
of b
paths. If we intersect posting list of search argument paths then we can
get list of candidates for search result. This is how jsonb_path_ops works.
The same idea could be applied to jsonb statistics. We could decompose each jsonb document into set of paths and then collect frequencies of most common individual paths. Such statistics perfectly fits current PostgreSQL system catalog and looks very similar to statistics of tsvectors and arrays, which are decomposed into lexemes and elements correspondingly. Such statistics of most common paths could look like following table.
Path  Frequency 

“a”.#.”b”.”xyz”  0.55 
“d”.”e”.#.77  0.43 
“a”.#.”b”.”def”  0.35 
“d”.”e”.#.100  0.22 
“d”.”f”.true  0.1 
Having such statistics we can estimate selectivity of @>
operator as product
of frequencies of search argument paths. For paths, which are not in most
common list, we can use some default “rare frequency”. Also, we use quite rough
assumption that paths appearance is independent. Let’s be honest: this
assumption is just wrong. However, this is typical assumption we have to use
during query planning. Finally, we don’t need absolutely accurate cost. Match of
magnitude order can be considered as a quite good result.
There is also another source or inaccuracy I’d like to mention. Let’s consider some example.
1 2 

Both a
and b
are decomposed into the same set of paths.
1 2 

However, neither a @> b
neither ‘b @> a’. Since we ignored array indexes in
paths we also ignore whether values beholds to same array element or not. This
leads also to false positives in GIN and overestimations by statistics.
This approach is not only limited by @>
operator. We can produce estimation
for queries with complex logic. Example in jsquery could be "(abc" = 1 OR
"xyz".# = "hij") AND NOT "def" = false
.
However, such statistics hardly can estimate selectivity of <
, <=
,
>=
, >
operators over jsonb values. For instance, in order to estimate
jsquery "x" > 1
we can only count most common paths, which match this
condition. But we’re lacking of histograms. It is a serious obstacle in getting
accurate estimates and it lets us search for better solution.
Collect scalar statistics for each key path
Another idea of jsonb statistics we can get from assumption that almost every “schemaless” dataset can be easily represented in the schema of tables. Assuming this we would like our selectivity estimates for search in jsonb documents to be as good as those for search in plain tables.
Let’s consider this on the example. The following json document could represent the information about order in ecommerce.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

The same information could be represented in the following couple of tables.
id  contact  phone  address 

1  John Smith  212 5551234  100213100, 21 2nd Street, New York 
order_id  article  name  price  quantity 

1  XF56120  Sunglasses  500  1 
1  AT10789  TShirt  100  2 
What kind of statictis would be collected by PostgreSQL in the second case? It would be most common values and histogram for each attribute. Most common values (MCVs) are values, which occur in the column most frequently. Frequencies of those values are collected and stored as well. Histogram is described by array of bounds. Each bound is assumed to contain equal number of column values excluding MCVs (so called equidepth histogram).
With some simplification such statistics could be represented in the following table.
Table  Attribute  Most common values  Histogram 

order  contact  {“John Smith”: 0.05, “James Johnson”: 0.01}  [“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”] 
product  price  {“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}  [0, 12.5, 45.5, 250, 1000] 
product  quantity  {“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}  [0, 4, 7, 9, 10] 
…….  ………  ………………………………………….  …………………………………………….. 
What if we replace table and attribute with path of keys where corresponding value could be found in json document?
Key path  Most common values  Histogram 

contact  {“John Smith”: 0.05, “James Johnson”: 0.01}  [“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”] 
products.#.price  {“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}  [0, 12.5, 45.5, 250, 1000] 
products.#.quantity  {“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}  [0, 4, 7, 9, 10] 
……………….  ………………………………………….  …………………………………………….. 
This kind of statistics seems to be comprehensive enough. It could produce fine
estimations for queries like products.#.price > 100
.
However, there are still bunch of open problems here.

Typical json documents we can meet in applications are really well structured as an example above. However, there are some cases when they are not. At first, someone could easily put values into keys. Let me illustrate this on the following example:
products
becomes an object where article is used as a key.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
</pre></td><td class='code'><pre>{
"id": 1,
"contact": "John Smith",
"phone": "212 5551234",
"address": "100213100, 21 2nd Street, New York",
"products":
{
"XF56120":
{
"name": "Sunglasses",
"price": 500,
"quantity": 1
},
"AT10789":
{
"name": "TShirt",
"price": 100,
"quantity": 2
}
}
}
</pre></td></tr></table></div></figure></notextile></div>
In this case we can find that cardinality of key paths are very high. Thus,
we would be unable to collect suitable statistics for each key path.
However, we could consider such situation as user mistake. Then we should
advise users to restructure their documents.
There are still kind of documents, which don't fit this model not because of
user mistake but because of their nature. Imagine json formatted query plans
stored in the table. Plans could have unlimited levels of nesting and
correspondingly cardinality of key paths could be very high.

Some objects stored inside jsonb documents could require special statistics. For instance, point coordinates could be represented in json as
{"x": 11.3, "y": 27.0}
. But statistics we will need in this case is not separate statistics forx
andy
. We would need something special for geometrical objects like 2Dhistograms. 
Another problem is fitting this model into PostgreSQL system catalog.
pg_statistic
assumes that statistics of attribute is represented by few arrays. However, in this model we have to store few arrays per each key path. For sure, we do a trick by storing array of jsonb or something like this, but that would be a kluge. It would be nice to store each key path in the separate row ofpg_statistic
. This would require significant changes in statistics handling though.
Conclusion
This was just my current thoughts about jsonb statistics. Probably, someone come with much better ideas. But I’m not sure we can find ideal solution, which would fit everyone needs. We can see that current developments in multivariate statistics use pluggable approach: user can turn on specific method on specific set of column. We could end up with something similar for jsonb: simple basic statistics + various kinds of pluggable statistics for specific needs.