Thoughts About Jsonb Statistics - Alexander Korotkov's blog

Introduction

Users of jsonb datatype frequently complaint that it lucks of statistics. Naturally, today jsonb statistics is just default scalar statistics, which is suitable for <, <=, =, >=, > operators selectivity estimation. But people search jsonb documents using @> operator, expressions with -> operator, jsquery etc. This is why selectivity estimation, which people typically get in their queries, is just a stub. This could lead wrong query plans and bad performance. And it made us introduce hints in jsquery extension.

Thus, problem is clear. But the right solution is still unclear, at least for me. Let me discuss evident approaches to jsonb statistics and their limitations.

Collect just frequent paths

First candidate for good selectivity estimation is @> operator. Really, @> is builtin operator with GIN index support. First idea that comes into mind is to collect most frequent paths and their frequencies as jsonb statistics. In order to understand idea of paths better let’s consider how GIN jsonb_path_ops works. jsonb_path_ops is builtin GIN operator class, which is most suitable for jsonb @> operator.

Path is a sequence of key names, array indexes and referenced value. For instance, the document {"a": [{"b": "xyz", "c": true}, 10], "d": {"e": [7, false]}} would be decomposed into following set of paths.

"a".#."b"."xyz"
"a".#."c".true
"a".#.10
"d"."e".#.7
"d"."e".#.false

In this representation of paths array indexes are replaced with #. That allows our search to be agnostic to them like @> operator does. Thus, when we have such decomposition we can say that if a @> b then a paths are superset of b paths. If we intersect posting list of search argument paths then we can get list of candidates for search result. This is how jsonb_path_ops works.

The same idea could be applied to jsonb statistics. We could decompose each jsonb document into set of paths and then collect frequencies of most common individual paths. Such statistics perfectly fits current PostgreSQL system catalog and looks very similar to statistics of tsvectors and arrays, which are decomposed into lexemes and elements correspondingly. Such statistics of most common paths could look like following table.

Path	Frequency
“a”.#.”b”.”xyz”	0.55
“d”.”e”.#.77	0.43
“a”.#.”b”.”def”	0.35
“d”.”e”.#.100	0.22
“d”.”f”.true	0.1

Having such statistics we can estimate selectivity of @> operator as product of frequencies of search argument paths. For paths, which are not in most common list, we can use some default “rare frequency”. Also, we use quite rough assumption that paths appearance is independent. Let’s be honest: this assumption is just wrong. However, this is typical assumption we have to use during query planning. Finally, we don’t need absolutely accurate cost. Match of magnitude order can be considered as a quite good result.

There is also another source or inaccuracy I’d like to mention. Let’s consider some example.

a = [{"x": [1]}, {"x": [2]}]
b = [{"x": [1,2]}]

Both a and b are decomposed into the same set of paths.

#."x".1
#."x".2

However, neither a @> b neither ‘b @> a’. Since we ignored array indexes in paths we also ignore whether values beholds to same array element or not. This leads also to false positives in GIN and overestimations by statistics.

This approach is not only limited by @> operator. We can produce estimation for queries with complex logic. Example in jsquery could be "(abc" = 1 OR "xyz".# = "hij") AND NOT "def" = false.

However, such statistics hardly can estimate selectivity of <, <=, >=, > operators over jsonb values. For instance, in order to estimate jsquery "x" > 1 we can only count most common paths, which match this condition. But we’re lacking of histograms. It is a serious obstacle in getting accurate estimates and it lets us search for better solution.

Collect scalar statistics for each key path

Another idea of jsonb statistics we can get from assumption that almost every “schemaless” dataset can be easily represented in the schema of tables. Assuming this we would like our selectivity estimates for search in jsonb documents to be as good as those for search in plain tables.

Let’s consider this on the example. The following json document could represent the information about order in e-commerce.

{
  "id": 1,
  "contact": "John Smith",
  "phone": "212 555-1234",
  "address": "10021-3100, 21 2nd Street, New York",
  "products":
  [
    {
      "article": "XF56120",
      "name": "Sunglasses",
      "price": 500,
      "quantity": 1
    },
    {
      "article": "AT10789",
      "name": "T-Shirt",
      "price": 100,
      "quantity": 2
    }
  ]
}

The same information could be represented in the following couple of tables.

id	contact	phone	address
1	John Smith	212 555-1234	10021-3100, 21 2nd Street, New York

order_id	article	name	price	quantity
1	XF56120	Sunglasses	500	1
1	AT10789	T-Shirt	100	2

What kind of statictis would be collected by PostgreSQL in the second case? It would be most common values and histogram for each attribute. Most common values (MCVs) are values, which occur in the column most frequently. Frequencies of those values are collected and stored as well. Histogram is described by array of bounds. Each bound is assumed to contain equal number of column values excluding MCVs (so called equi-depth histogram).

With some simplification such statistics could be represented in the following table.

Table	Attribute	Most common values	Histogram
order	contact	{“John Smith”: 0.05, “James Johnson”: 0.01}	[“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”]
product	price	{“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}	[0, 12.5, 45.5, 250, 1000]
product	quantity	{“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}	[0, 4, 7, 9, 10]
…….	………	………………………………………….	……………………………………………..

What if we replace table and attribute with path of keys where corresponding value could be found in json document?

Key path	Most common values	Histogram
contact	{“John Smith”: 0.05, “James Johnson”: 0.01}	[“Anthony Anderson”, “Lisa Baker”, “Sandra Phillips”]
products.#.price	{“100”: 0.1, “10”: 0.08, “50”: 0.05, “150”: 0.03}	[0, 12.5, 45.5, 250, 1000]
products.#.quantity	{“1”: 0.5, “2”: 0.2, “3”: 0.05, “5”: 0.01}	[0, 4, 7, 9, 10]
……………….	………………………………………….	……………………………………………..

This kind of statistics seems to be comprehensive enough. It could produce fine estimations for queries like products.#.price > 100.

However, there are still bunch of open problems here.

Typical json documents we can meet in applications are really well structured as an example above. However, there are some cases when they are not. At first, someone could easily put values into keys. Let me illustrate this on the following example: products becomes an object where article is used as a key.

In this case we can find that cardinality of key paths are very high. Thus, we would be unable to collect suitable statistics for each key path. However, we could consider such situation as user mistake. Then we should advise users to restructure their documents.

There are still kind of documents, which don’t fit this model not because of user mistake but because of their nature. Imagine json formatted query plans stored in the table. Plans could have unlimited levels of nesting and correspondingly cardinality of key paths could be very high.

{
  "id": 1,
  "contact": "John Smith",
  "phone": "212 555-1234",
  "address": "10021-3100, 21 2nd Street, New York",
  "products":
  {
    "XF56120":
    {
      "name": "Sunglasses",
      "price": 500,
      "quantity": 1
    },
    "AT10789":
    {
      "name": "T-Shirt",
      "price": 100,
      "quantity": 2
    }
  }
}

Some objects stored inside jsonb documents could require special statistics. For instance, point coordinates could be represented in json as {"x": 11.3, "y": 27.0}. But statistics we will need in this case is not separate statistics for x and y. We would need something special for geometrical objects like 2D-histograms.
Another problem is fitting this model into PostgreSQL system catalog. pg_statistic assumes that statistics of attribute is represented by few arrays. However, in this model we have to store few arrays per each key path. For sure, we do a trick by storing array of jsonb or something like this, but that would be a kluge. It would be nice to store each key path in the separate row of pg_statistic. This would require significant changes in statistics handling though.

Conclusion

This was just my current thoughts about jsonb statistics. Probably, someone come with much better ideas. But I’m not sure we can find ideal solution, which would fit everyone needs. We can see that current developments in multivariate statistics use pluggable approach: user can turn on specific method on specific set of column. We could end up with something similar for jsonb: simple basic statistics + various kinds of pluggable statistics for specific needs.

Introduction

Collect just frequent paths

Collect scalar statistics for each key path

Conclusion

Comments