diff options
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/catalogs.sgml | 7 | ||||
-rw-r--r-- | doc/src/sgml/charset.sgml | 61 | ||||
-rw-r--r-- | doc/src/sgml/citext.sgml | 21 | ||||
-rw-r--r-- | doc/src/sgml/func.sgml | 6 | ||||
-rw-r--r-- | doc/src/sgml/ref/create_collation.sgml | 22 |
5 files changed, 112 insertions, 5 deletions
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 0fd792ff1a2..45ed077654e 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2078,6 +2078,13 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l </row> <row> + <entry><structfield>collisdeterministic</structfield></entry> + <entry><type>bool</type></entry> + <entry></entry> + <entry>Is the collation deterministic?</entry> + </row> + + <row> <entry><structfield>collencoding</structfield></entry> <entry><type>int4</type></entry> <entry></entry> diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index a6143ef8a74..555d1b4ac63 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -847,11 +847,13 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE'); <para> Note that while this system allows creating collations that <quote>ignore - case</quote> or <quote>ignore accents</quote> or similar (using - the <literal>ks</literal> key), PostgreSQL does not at the moment allow - such collations to act in a truly case- or accent-insensitive manner. Any - strings that compare equal according to the collation but are not - byte-wise equal will be sorted according to their byte values. + case</quote> or <quote>ignore accents</quote> or similar (using the + <literal>ks</literal> key), in order for such collations to act in a + truly case- or accent-insensitive manner, they also need to be declared as not + <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>; + see <xref linkend="collation-nondeterministic"/>. + Otherwise, any strings that compare equal according to the collation but + are not byte-wise equal will be sorted according to their byte values. </para> <note> @@ -883,6 +885,55 @@ CREATE COLLATION french FROM "fr-x-icu"; </para> </sect4> </sect3> + + <sect3 id="collation-nondeterministic"> + <title>Nondeterminstic Collations</title> + + <para> + A collation is either <firstterm>deterministic</firstterm> or + <firstterm>nondeterministic</firstterm>. A deterministic collation uses + deterministic comparisons, which means that it considers strings to be + equal only if they consist of the same byte sequence. Nondeterministic + comparison may determine strings to be equal even if they consist of + different bytes. Typical situations include case-insensitive comparison, + accent-insensitive comparison, as well as comparion of strings in + different Unicode normal forms. It is up to the collation provider to + actually implement such insensitive comparisons; the deterministic flag + only determines whether ties are to be broken using bytewise comparison. + See also <ulink url="https://unicode.org/reports/tr10">Unicode Technical + Standard 10</ulink> for more information on the terminology. + </para> + + <para> + To create a nondeterministic collation, specify the property + <literal>deterministic = false</literal> to <command>CREATE + COLLATION</command>, for example: +<programlisting> +CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false); +</programlisting> + This example would use the standard Unicode collation in a + nondeterministic way. In particular, this would allow strings in + different normal forms to be compared correctly. More interesting + examples make use of the ICU customization facilities explained above. + For example: +<programlisting> +CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); +CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); +</programlisting> + </para> + + <para> + All standard and predefined collations are deterministic, all + user-defined collations are deterministic by default. While + nondeterministic collations give a more <quote>correct</quote> behavior, + especially when considering the full power of Unicode and its many + special cases, they also have some drawbacks. Foremost, their use leads + to a performance penalty. Also, certain operations are not possible with + nondeterministic collations, such as pattern matching operations. + Therefore, they should be used only in cases where they are specifically + wanted. + </para> + </sect3> </sect2> </sect1> diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml index b1fe7101b20..85aa339d8ba 100644 --- a/doc/src/sgml/citext.sgml +++ b/doc/src/sgml/citext.sgml @@ -14,6 +14,16 @@ exactly like <type>text</type>. </para> + <tip> + <para> + Consider using <firstterm>nondeterministic collations</firstterm> (see + <xref linkend="collation-nondeterministic"/>) instead of this module. They + can be used for case-insensitive comparisons, accent-insensitive + comparisons, and other combinations, and they handle more Unicode special + cases correctly. + </para> + </tip> + <sect2> <title>Rationale</title> @@ -246,6 +256,17 @@ SELECT * FROM users WHERE nick = 'Larry'; will be invoked instead. </para> </listitem> + + <listitem> + <para> + The approach of lower-casing strings for comparison does not handle some + Unicode special cases correctly, for example when one upper-case letter + has two lower-case letter equivalents. Unicode distinguishes between + <firstterm>case mapping</firstterm> and <firstterm>case + folding</firstterm> for this reason. Use nondeterministic collations + instead of <type>citext</type> to handle that correctly. + </para> + </listitem> </itemizedlist> </sect2> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3a99e209a2b..1a014732919 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -4065,6 +4065,12 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> </caution> + <para> + The pattern matching operators of all three kinds do not support + nondeterministic collations. If required, apply a different collation to + the expression to work around this limitation. + </para> + <sect2 id="functions-like"> <title><function>LIKE</function></title> diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml index 038797fce11..def4dda6e88 100644 --- a/doc/src/sgml/ref/create_collation.sgml +++ b/doc/src/sgml/ref/create_collation.sgml @@ -23,6 +23,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> ( [ LC_COLLATE = <replaceable>lc_collate</replaceable>, ] [ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ] [ PROVIDER = <replaceable>provider</replaceable>, ] + [ DETERMINISTIC = <replaceable>boolean</replaceable>, ] [ VERSION = <replaceable>version</replaceable> ] ) CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable> @@ -125,6 +126,27 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace </varlistentry> <varlistentry> + <term><literal>DETERMINISTIC</literal></term> + + <listitem> + <para> + Specifies whether the collation should use deterministic comparisons. + The default is true. A deterministic comparison considers strings that + are not byte-wise equal to be unequal even if they are considered + logically equal by the comparison. PostgreSQL breaks ties using a + byte-wise comparison. Comparison that is not deterministic can make the + collation be, say, case- or accent-insensitive. For that, you need to + choose an appropriate <literal>LC_COLLATE</literal> setting + <emphasis>and</emphasis> set the collation to not deterministic here. + </para> + + <para> + Nondeterministic collations are only supported with the ICU provider. + </para> + </listitem> + </varlistentry> + + <varlistentry> <term><replaceable>version</replaceable></term> <listitem> |