aboutsummaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/btree.sgml201
-rw-r--r--doc/src/sgml/charset.sgml9
-rw-r--r--doc/src/sgml/citext.sgml7
-rw-r--r--doc/src/sgml/func.sgml9
-rw-r--r--doc/src/sgml/ref/create_index.sgml44
5 files changed, 253 insertions, 17 deletions
diff --git a/doc/src/sgml/btree.sgml b/doc/src/sgml/btree.sgml
index fcf771c857f..f02e02b0acc 100644
--- a/doc/src/sgml/btree.sgml
+++ b/doc/src/sgml/btree.sgml
@@ -557,11 +557,208 @@ equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
<sect1 id="btree-implementation">
<title>Implementation</title>
+ <para>
+ This section covers B-Tree index implementation details that may be
+ of use to advanced users. See
+ <filename>src/backend/access/nbtree/README</filename> in the source
+ distribution for a much more detailed, internals-focused description
+ of the B-Tree implementation.
+ </para>
+ <sect2 id="btree-structure">
+ <title>B-Tree Structure</title>
+ <para>
+ <productname>PostgreSQL</productname> B-Tree indexes are
+ multi-level tree structures, where each level of the tree can be
+ used as a doubly-linked list of pages. A single metapage is stored
+ in a fixed position at the start of the first segment file of the
+ index. All other pages are either leaf pages or internal pages.
+ Leaf pages are the pages on the lowest level of the tree. All
+ other levels consist of internal pages. Each leaf page contains
+ tuples that point to table rows. Each internal page contains
+ tuples that point to the next level down in the tree. Typically,
+ over 99% of all pages are leaf pages. Both internal pages and leaf
+ pages use the standard page format described in <xref
+ linkend="storage-page-layout"/>.
+ </para>
+ <para>
+ New leaf pages are added to a B-Tree index when an existing leaf
+ page cannot fit an incoming tuple. A <firstterm>page
+ split</firstterm> operation makes room for items that originally
+ belonged on the overflowing page by moving a portion of the items
+ to a new page. Page splits must also insert a new
+ <firstterm>downlink</firstterm> to the new page in the parent page,
+ which may cause the parent to split in turn. Page splits
+ <quote>cascade upwards</quote> in a recursive fashion. When the
+ root page finally cannot fit a new downlink, a <firstterm>root page
+ split</firstterm> operation takes place. This adds a new level to
+ the tree structure by creating a new root page that is one level
+ above the original root page.
+ </para>
+ </sect2>
+
+ <sect2 id="btree-deduplication">
+ <title>Deduplication</title>
+ <para>
+ A duplicate is a leaf page tuple (a tuple that points to a table
+ row) where <emphasis>all</emphasis> indexed key columns have values
+ that match corresponding column values from at least one other leaf
+ page tuple that's close by in the same index. Duplicate tuples are
+ quite common in practice. B-Tree indexes can use a special,
+ space-efficient representation for duplicates when an optional
+ technique is enabled: <firstterm>deduplication</firstterm>.
+ </para>
+ <para>
+ Deduplication works by periodically merging groups of duplicate
+ tuples together, forming a single posting list tuple for each
+ group. The column key value(s) only appear once in this
+ representation. This is followed by a sorted array of
+ <acronym>TID</acronym>s that point to rows in the table. This
+ significantly reduces the storage size of indexes where each value
+ (or each distinct combination of column values) appears several
+ times on average. The latency of queries can be reduced
+ significantly. Overall query throughput may increase
+ significantly. The overhead of routine index vacuuming may also be
+ reduced significantly.
+ </para>
+ <note>
+ <para>
+ While NULL is generally not considered to be equal to any other
+ value, including NULL, NULL is nevertheless treated as just
+ another value from the domain of indexed values by the B-Tree
+ implementation (except when enforcing uniqueness in a unique
+ index). B-Tree deduplication is therefore just as effective with
+ <quote>duplicates</quote> that contain a NULL value.
+ </para>
+ </note>
+ <para>
+ The deduplication process occurs lazily, when a new item is
+ inserted that cannot fit on an existing leaf page. This prevents
+ (or at least delays) leaf page splits. Unlike GIN posting list
+ tuples, B-Tree posting list tuples do not need to expand every time
+ a new duplicate is inserted; they are merely an alternative
+ physical representation of the original logical contents of the
+ leaf page. This design prioritizes consistent performance with
+ mixed read-write workloads. Most client applications will at least
+ see a moderate performance benefit from using deduplication.
+ Deduplication is enabled by default.
+ </para>
+ <para>
+ Write-heavy workloads that don't benefit from deduplication due to
+ having few or no duplicate values in indexes will incur a small,
+ fixed performance penalty (unless deduplication is explicitly
+ disabled). The <literal>deduplicate_items</literal> storage
+ parameter can be used to disable deduplication within individual
+ indexes. There is never any performance penalty with read-only
+ workloads, since reading posting list tuples is at least as
+ efficient as reading the standard tuple representation. Disabling
+ deduplication isn't usually helpful.
+ </para>
+ <para>
+ B-Tree indexes are not directly aware that under MVCC, there might
+ be multiple extant versions of the same logical table row; to an
+ index, each tuple is an independent object that needs its own index
+ entry. Thus, an update of a row always creates all-new index
+ entries for the row, even if the key values did not change. Some
+ workloads suffer from index bloat caused by these
+ implementation-level version duplicates (this is typically a
+ problem for <command>UPDATE</command>-heavy workloads that cannot
+ apply the <acronym>HOT</acronym> optimization due to modifying at
+ least one indexed column). B-Tree deduplication does not
+ distinguish between these implementation-level version duplicates
+ and conventional duplicates. Deduplication can nevertheless help
+ with controlling index bloat caused by implementation-level version
+ churn.
+ </para>
+ <tip>
+ <para>
+ A special heuristic is applied to determine whether a
+ deduplication pass in a unique index should take place. It can
+ often skip straight to splitting a leaf page, avoiding a
+ performance penalty from wasting cycles on unhelpful deduplication
+ passes. If you're concerned about the overhead of deduplication,
+ consider setting <literal>deduplicate_items = off</literal>
+ selectively. Leaving deduplication enabled in unique indexes has
+ little downside.
+ </para>
+ </tip>
+ <para>
+ Deduplication cannot be used in all cases due to
+ implementation-level restrictions. Deduplication safety is
+ determined when <command>CREATE INDEX</command> or
+ <command>REINDEX</command> run.
+ </para>
+ <para>
+ Note that deduplication is deemed unsafe and cannot be used in the
+ following cases involving semantically significant differences
+ among equal datums:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ <type>text</type>, <type>varchar</type>, and <type>char</type>
+ cannot use deduplication when a
+ <emphasis>nondeterministic</emphasis> collation is used. Case
+ and accent differences must be preserved among equal datums.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>numeric</type> cannot use deduplication. Numeric display
+ scale must be preserved among equal datums.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>jsonb</type> cannot use deduplication, since the
+ <type>jsonb</type> B-Tree operator class uses
+ <type>numeric</type> internally.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>
+ <type>float4</type> and <type>float8</type> cannot use
+ deduplication. These types have distinct representations for
+ <literal>-0</literal> and <literal>0</literal>, which are
+ nevertheless considered equal. This difference must be
+ preserved.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ There is one further implementation-level restriction that may be
+ lifted in a future version of
+ <productname>PostgreSQL</productname>:
+ </para>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>
+ Container types (such as composite types, arrays, or range
+ types) cannot use deduplication.
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+ <para>
+ There is one further implementation-level restriction that applies
+ regardless of the operator class or collation used:
+ </para>
<para>
- An introduction to the btree index implementation can be found in
- <filename>src/backend/access/nbtree/README</filename>.
+ <itemizedlist>
+ <listitem>
+ <para>
+ <literal>INCLUDE</literal> indexes can never use deduplication.
+ </para>
+ </listitem>
+ </itemizedlist>
</para>
+ </sect2>
</sect1>
</chapter>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 057a6bb81a5..20cdfabd7bf 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -928,10 +928,11 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
nondeterministic collations give a more <quote>correct</quote> behavior,
especially when considering the full power of Unicode and its many
special cases, they also have some drawbacks. Foremost, their use leads
- to a performance penalty. Also, certain operations are not possible with
- nondeterministic collations, such as pattern matching operations.
- Therefore, they should be used only in cases where they are specifically
- wanted.
+ to a performance penalty. Note, in particular, that B-tree cannot use
+ deduplication with indexes that use a nondeterministic collation. Also,
+ certain operations are not possible with nondeterministic collations,
+ such as pattern matching operations. Therefore, they should be used
+ only in cases where they are specifically wanted.
</para>
</sect3>
</sect2>
diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml
index 667824fb0b8..59866013271 100644
--- a/doc/src/sgml/citext.sgml
+++ b/doc/src/sgml/citext.sgml
@@ -233,9 +233,10 @@ SELECT * FROM users WHERE nick = 'Larry';
<para>
<type>citext</type> is not as efficient as <type>text</type> because the
operator functions and the B-tree comparison functions must make copies
- of the data and convert it to lower case for comparisons. It is,
- however, slightly more efficient than using <function>lower</function> to get
- case-insensitive matching.
+ of the data and convert it to lower case for comparisons. Also, only
+ <type>text</type> can support B-Tree deduplication. However,
+ <type>citext</type> is slightly more efficient than using
+ <function>lower</function> to get case-insensitive matching.
</para>
</listitem>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index ceda48e0fc3..28035f1635c 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -16561,10 +16561,11 @@ AND
rows. Two rows might have a different binary representation even
though comparisons of the two rows with the equality operator is true.
The ordering of rows under these comparison operators is deterministic
- but not otherwise meaningful. These operators are used internally for
- materialized views and might be useful for other specialized purposes
- such as replication but are not intended to be generally useful for
- writing queries.
+ but not otherwise meaningful. These operators are used internally
+ for materialized views and might be useful for other specialized
+ purposes such as replication and B-Tree deduplication (see <xref
+ linkend="btree-deduplication"/>). They are not intended to be
+ generally useful for writing queries, though.
</para>
</sect2>
</sect1>
diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index ab362a0dc52..a05e2e6b9ca 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -171,6 +171,8 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
maximum size allowed for the index type, data insertion will fail.
In any case, non-key columns duplicate data from the index's table
and bloat the size of the index, thus potentially slowing searches.
+ Furthermore, B-tree deduplication is never used with indexes
+ that have a non-key column.
</para>
<para>
@@ -393,10 +395,39 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
</variablelist>
<para>
- B-tree indexes additionally accept this parameter:
+ B-tree indexes also accept these parameters:
</para>
<variablelist>
+ <varlistentry id="index-reloption-deduplication" xreflabel="deduplicate_items">
+ <term><literal>deduplicate_items</literal>
+ <indexterm>
+ <primary><varname>deduplicate_items</varname></primary>
+ <secondary>storage parameter</secondary>
+ </indexterm>
+ </term>
+ <listitem>
+ <para>
+ Controls usage of the B-tree deduplication technique described
+ in <xref linkend="btree-deduplication"/>. Set to
+ <literal>ON</literal> or <literal>OFF</literal> to enable or
+ disable the optimization. (Alternative spellings of
+ <literal>ON</literal> and <literal>OFF</literal> are allowed as
+ described in <xref linkend="config-setting"/>.) The default is
+ <literal>ON</literal>.
+ </para>
+
+ <note>
+ <para>
+ Turning <literal>deduplicate_items</literal> off via
+ <command>ALTER INDEX</command> prevents future insertions from
+ triggering deduplication, but does not in itself make existing
+ posting list tuples use the standard tuple representation.
+ </para>
+ </note>
+ </listitem>
+ </varlistentry>
+
<varlistentry id="index-reloption-vacuum-cleanup-index-scale-factor" xreflabel="vacuum_cleanup_index_scale_factor">
<term><literal>vacuum_cleanup_index_scale_factor</literal>
<indexterm>
@@ -451,9 +482,7 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] <replaceable class=
This setting controls usage of the fast update technique described in
<xref linkend="gin-fast-update"/>. It is a Boolean parameter:
<literal>ON</literal> enables fast update, <literal>OFF</literal> disables it.
- (Alternative spellings of <literal>ON</literal> and <literal>OFF</literal> are
- allowed as described in <xref linkend="config-setting"/>.) The
- default is <literal>ON</literal>.
+ The default is <literal>ON</literal>.
</para>
<note>
@@ -806,6 +835,13 @@ CREATE UNIQUE INDEX title_idx ON films (title) INCLUDE (director, rating);
</para>
<para>
+ To create a B-Tree index with deduplication disabled:
+<programlisting>
+CREATE INDEX title_idx ON films (title) WITH (deduplicate_items = off);
+</programlisting>
+ </para>
+
+ <para>
To create an index on the expression <literal>lower(title)</literal>,
allowing efficient case-insensitive searches:
<programlisting>