aboutsummaryrefslogtreecommitdiff
path: root/doc/src/sgml/ref/create_statistics.sgml
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src/sgml/ref/create_statistics.sgml')
-rw-r--r--doc/src/sgml/ref/create_statistics.sgml116
1 files changed, 111 insertions, 5 deletions
diff --git a/doc/src/sgml/ref/create_statistics.sgml b/doc/src/sgml/ref/create_statistics.sgml
index 4363be50c3c..988f4c573ff 100644
--- a/doc/src/sgml/ref/create_statistics.sgml
+++ b/doc/src/sgml/ref/create_statistics.sgml
@@ -22,8 +22,12 @@ PostgreSQL documentation
<refsynopsisdiv>
<synopsis>
CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
+ ON ( <replaceable class="parameter">expression</replaceable> )
+ FROM <replaceable class="parameter">table_name</replaceable>
+
+CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_name</replaceable>
[ ( <replaceable class="parameter">statistics_kind</replaceable> [, ... ] ) ]
- ON <replaceable class="parameter">column_name</replaceable>, <replaceable class="parameter">column_name</replaceable> [, ...]
+ ON { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) }, { <replaceable class="parameter">column_name</replaceable> | ( <replaceable class="parameter">expression</replaceable> ) } [, ...]
FROM <replaceable class="parameter">table_name</replaceable>
</synopsis>
@@ -40,6 +44,19 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
</para>
<para>
+ The <command>CREATE STATISTICS</command> command has two basic forms. The
+ first form allows univariate statistics for a single expression to be
+ collected, providing benefits similar to an expression index without the
+ overhead of index maintenance. This form does not allow the statistics
+ kind to be specified, since the various statistics kinds refer only to
+ multivariate statistics. The second form of the command allows
+ multivariate statistics on multiple columns and/or expressions to be
+ collected, optionally specifying which statistics kinds to include. This
+ form will also automatically cause univariate statistics to be collected on
+ any expressions included in the list.
+ </para>
+
+ <para>
If a schema name is given (for example, <literal>CREATE STATISTICS
myschema.mystat ...</literal>) then the statistics object is created in the
specified schema. Otherwise it is created in the current schema.
@@ -79,14 +96,16 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
<term><replaceable class="parameter">statistics_kind</replaceable></term>
<listitem>
<para>
- A statistics kind to be computed in this statistics object.
+ A multivariate statistics kind to be computed in this statistics object.
Currently supported kinds are
<literal>ndistinct</literal>, which enables n-distinct statistics,
<literal>dependencies</literal>, which enables functional
dependency statistics, and <literal>mcv</literal> which enables
most-common values lists.
If this clause is omitted, all supported statistics kinds are
- included in the statistics object.
+ included in the statistics object. Univariate expression statistics are
+ built automatically if the statistics definition includes any complex
+ expressions rather than just simple column references.
For more information, see <xref linkend="planner-stats-extended"/>
and <xref linkend="multivariate-statistics-examples"/>.
</para>
@@ -98,8 +117,22 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
<listitem>
<para>
The name of a table column to be covered by the computed statistics.
- At least two column names must be given; the order of the column names
- is insignificant.
+ This is only allowed when building multivariate statistics. At least
+ two column names or expressions must be specified, and their order is
+ not significant.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term><replaceable class="parameter">expression</replaceable></term>
+ <listitem>
+ <para>
+ An expression to be covered by the computed statistics. This may be
+ used to build univariate statistics on a single expression, or as part
+ of a list of multiple column names and/or expressions to build
+ multivariate statistics. In the latter case, separate univariate
+ statistics are built automatically for each expression in the list.
</para>
</listitem>
</varlistentry>
@@ -125,6 +158,13 @@ CREATE STATISTICS [ IF NOT EXISTS ] <replaceable class="parameter">statistics_na
reading it. Once created, however, the ownership of the statistics
object is independent of the underlying table(s).
</para>
+
+ <para>
+ Expression statistics are per-expression and are similar to creating an
+ index on the expression, except that they avoid the overhead of index
+ maintenance. Expression statistics are built automatically for each
+ expression in the statistics object definition.
+ </para>
</refsect1>
<refsect1 id="sql-createstatistics-examples">
@@ -196,6 +236,72 @@ EXPLAIN ANALYZE SELECT * FROM t2 WHERE (a = 1) AND (b = 2);
in the table, allowing it to generate better estimates in both cases.
</para>
+ <para>
+ Create table <structname>t3</structname> with a single timestamp column,
+ and run queries using expressions on that column. Without extended
+ statistics, the planner has no information about the data distribution for
+ the expressions, and uses default estimates. The planner also does not
+ realize that the value of the date truncated to the month is fully
+ determined by the value of the date truncated to the day. Then expression
+ and ndistinct statistics are built on those two expressions:
+
+<programlisting>
+CREATE TABLE t3 (
+ a timestamp
+);
+
+INSERT INTO t3 SELECT i FROM generate_series('2020-01-01'::timestamp,
+ '2020-12-31'::timestamp,
+ '1 minute'::interval) s(i);
+
+ANALYZE t3;
+
+-- the number of matching rows will be drastically underestimated:
+EXPLAIN ANALYZE SELECT * FROM t3
+ WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
+
+EXPLAIN ANALYZE SELECT * FROM t3
+ WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
+ AND '2020-06-30'::timestamp;
+
+EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
+ FROM t3 GROUP BY 1, 2;
+
+-- build ndistinct statistics on the pair of expressions (per-expression
+-- statistics are built automatically)
+CREATE STATISTICS s3 (ndistinct) ON date_trunc('month', a), date_trunc('day', a) FROM t3;
+
+ANALYZE t3;
+
+-- now the row count estimates are more accurate:
+EXPLAIN ANALYZE SELECT * FROM t3
+ WHERE date_trunc('month', a) = '2020-01-01'::timestamp;
+
+EXPLAIN ANALYZE SELECT * FROM t3
+ WHERE date_trunc('day', a) BETWEEN '2020-01-01'::timestamp
+ AND '2020-06-30'::timestamp;
+
+EXPLAIN ANALYZE SELECT date_trunc('month', a), date_trunc('day', a)
+ FROM t3 GROUP BY 1, 2;
+</programlisting>
+
+ Without expression and ndistinct statistics, the planner has no information
+ about the number of distinct values for the expressions, and has to rely
+ on default estimates. The equality and range conditions are assumed to have
+ 0.5% selectivity, and the number of distinct values in the expression is
+ assumed to be the same as for the column (i.e. unique). This results in a
+ significant underestimate of the row count in the first two queries. Moreover,
+ the planner has no information about the relationship between the expressions,
+ so it assumes the two <literal>WHERE</literal> and <literal>GROUP BY</literal>
+ conditions are independent, and multiplies their selectivities together to
+ arrive at a severe overestimate of the group count in the aggregate query.
+ This is further exacerbated by the lack of accurate statistics for the
+ expressions, forcing the planner to use a default ndistinct estimate for the
+ expression derived from ndistinct for the column. With such statistics, the
+ planner recognizes that the conditions are correlated, and arrives at much
+ more accurate estimates.
+ </para>
+
</refsect1>
<refsect1>