aboutsummaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/catalogs.sgml7
-rw-r--r--doc/src/sgml/charset.sgml61
-rw-r--r--doc/src/sgml/citext.sgml21
-rw-r--r--doc/src/sgml/func.sgml6
-rw-r--r--doc/src/sgml/ref/create_collation.sgml22
5 files changed, 112 insertions, 5 deletions
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 0fd792ff1a2..45ed077654e 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -2078,6 +2078,13 @@ SCRAM-SHA-256$<replaceable>&lt;iteration count&gt;</replaceable>:<replaceable>&l
</row>
<row>
+ <entry><structfield>collisdeterministic</structfield></entry>
+ <entry><type>bool</type></entry>
+ <entry></entry>
+ <entry>Is the collation deterministic?</entry>
+ </row>
+
+ <row>
<entry><structfield>collencoding</structfield></entry>
<entry><type>int4</type></entry>
<entry></entry>
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index a6143ef8a74..555d1b4ac63 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -847,11 +847,13 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE');
<para>
Note that while this system allows creating collations that <quote>ignore
- case</quote> or <quote>ignore accents</quote> or similar (using
- the <literal>ks</literal> key), PostgreSQL does not at the moment allow
- such collations to act in a truly case- or accent-insensitive manner. Any
- strings that compare equal according to the collation but are not
- byte-wise equal will be sorted according to their byte values.
+ case</quote> or <quote>ignore accents</quote> or similar (using the
+ <literal>ks</literal> key), in order for such collations to act in a
+ truly case- or accent-insensitive manner, they also need to be declared as not
+ <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>;
+ see <xref linkend="collation-nondeterministic"/>.
+ Otherwise, any strings that compare equal according to the collation but
+ are not byte-wise equal will be sorted according to their byte values.
</para>
<note>
@@ -883,6 +885,55 @@ CREATE COLLATION french FROM "fr-x-icu";
</para>
</sect4>
</sect3>
+
+ <sect3 id="collation-nondeterministic">
+ <title>Nondeterminstic Collations</title>
+
+ <para>
+ A collation is either <firstterm>deterministic</firstterm> or
+ <firstterm>nondeterministic</firstterm>. A deterministic collation uses
+ deterministic comparisons, which means that it considers strings to be
+ equal only if they consist of the same byte sequence. Nondeterministic
+ comparison may determine strings to be equal even if they consist of
+ different bytes. Typical situations include case-insensitive comparison,
+ accent-insensitive comparison, as well as comparion of strings in
+ different Unicode normal forms. It is up to the collation provider to
+ actually implement such insensitive comparisons; the deterministic flag
+ only determines whether ties are to be broken using bytewise comparison.
+ See also <ulink url="https://unicode.org/reports/tr10">Unicode Technical
+ Standard 10</ulink> for more information on the terminology.
+ </para>
+
+ <para>
+ To create a nondeterministic collation, specify the property
+ <literal>deterministic = false</literal> to <command>CREATE
+ COLLATION</command>, for example:
+<programlisting>
+CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false);
+</programlisting>
+ This example would use the standard Unicode collation in a
+ nondeterministic way. In particular, this would allow strings in
+ different normal forms to be compared correctly. More interesting
+ examples make use of the ICU customization facilities explained above.
+ For example:
+<programlisting>
+CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false);
+CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false);
+</programlisting>
+ </para>
+
+ <para>
+ All standard and predefined collations are deterministic, all
+ user-defined collations are deterministic by default. While
+ nondeterministic collations give a more <quote>correct</quote> behavior,
+ especially when considering the full power of Unicode and its many
+ special cases, they also have some drawbacks. Foremost, their use leads
+ to a performance penalty. Also, certain operations are not possible with
+ nondeterministic collations, such as pattern matching operations.
+ Therefore, they should be used only in cases where they are specifically
+ wanted.
+ </para>
+ </sect3>
</sect2>
</sect1>
diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml
index b1fe7101b20..85aa339d8ba 100644
--- a/doc/src/sgml/citext.sgml
+++ b/doc/src/sgml/citext.sgml
@@ -14,6 +14,16 @@
exactly like <type>text</type>.
</para>
+ <tip>
+ <para>
+ Consider using <firstterm>nondeterministic collations</firstterm> (see
+ <xref linkend="collation-nondeterministic"/>) instead of this module. They
+ can be used for case-insensitive comparisons, accent-insensitive
+ comparisons, and other combinations, and they handle more Unicode special
+ cases correctly.
+ </para>
+ </tip>
+
<sect2>
<title>Rationale</title>
@@ -246,6 +256,17 @@ SELECT * FROM users WHERE nick = 'Larry';
will be invoked instead.
</para>
</listitem>
+
+ <listitem>
+ <para>
+ The approach of lower-casing strings for comparison does not handle some
+ Unicode special cases correctly, for example when one upper-case letter
+ has two lower-case letter equivalents. Unicode distinguishes between
+ <firstterm>case mapping</firstterm> and <firstterm>case
+ folding</firstterm> for this reason. Use nondeterministic collations
+ instead of <type>citext</type> to handle that correctly.
+ </para>
+ </listitem>
</itemizedlist>
</sect2>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 3a99e209a2b..1a014732919 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4065,6 +4065,12 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
</para>
</caution>
+ <para>
+ The pattern matching operators of all three kinds do not support
+ nondeterministic collations. If required, apply a different collation to
+ the expression to work around this limitation.
+ </para>
+
<sect2 id="functions-like">
<title><function>LIKE</function></title>
diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml
index 038797fce11..def4dda6e88 100644
--- a/doc/src/sgml/ref/create_collation.sgml
+++ b/doc/src/sgml/ref/create_collation.sgml
@@ -23,6 +23,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> (
[ LC_COLLATE = <replaceable>lc_collate</replaceable>, ]
[ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ]
[ PROVIDER = <replaceable>provider</replaceable>, ]
+ [ DETERMINISTIC = <replaceable>boolean</replaceable>, ]
[ VERSION = <replaceable>version</replaceable> ]
)
CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable>
@@ -125,6 +126,27 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace
</varlistentry>
<varlistentry>
+ <term><literal>DETERMINISTIC</literal></term>
+
+ <listitem>
+ <para>
+ Specifies whether the collation should use deterministic comparisons.
+ The default is true. A deterministic comparison considers strings that
+ are not byte-wise equal to be unequal even if they are considered
+ logically equal by the comparison. PostgreSQL breaks ties using a
+ byte-wise comparison. Comparison that is not deterministic can make the
+ collation be, say, case- or accent-insensitive. For that, you need to
+ choose an appropriate <literal>LC_COLLATE</literal> setting
+ <emphasis>and</emphasis> set the collation to not deterministic here.
+ </para>
+
+ <para>
+ Nondeterministic collations are only supported with the ICU provider.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
<term><replaceable>version</replaceable></term>
<listitem>