diff options
author | Peter Eisentraut <peter@eisentraut.org> | 2019-03-22 12:09:32 +0100 |
---|---|---|
committer | Peter Eisentraut <peter@eisentraut.org> | 2019-03-22 12:12:43 +0100 |
commit | 5e1963fb764e9cc092e0f7b58b28985c311431d9 (patch) | |
tree | 544492f24e3d48d00bd2a19c11663f84f1e18ce4 /doc/src | |
parent | 2ab6d28d233af17987ea323e3235b2bda89b4f2e (diff) | |
download | postgresql-5e1963fb764e9cc092e0f7b58b28985c311431d9.tar.gz postgresql-5e1963fb764e9cc092e0f7b58b28985c311431d9.zip |
Collations with nondeterministic comparison
This adds a flag "deterministic" to collations. If that is false,
such a collation disables various optimizations that assume that
strings are equal only if they are byte-wise equal. That then allows
use cases such as case-insensitive or accent-insensitive comparisons
or handling of strings with different Unicode normal forms.
This functionality is only supported with the ICU provider. At least
glibc doesn't appear to have any locales that work in a
nondeterministic way, so it's not worth supporting this for the libc
provider.
The term "deterministic comparison" in this context is from Unicode
Technical Standard #10
(https://unicode.org/reports/tr10/#Deterministic_Comparison).
This patch makes changes in three areas:
- CREATE COLLATION DDL changes and system catalog changes to support
this new flag.
- Many executor nodes and auxiliary code are extended to track
collations. Previously, this code would just throw away collation
information, because the eventually-called user-defined functions
didn't use it since they only cared about equality, which didn't
need collation information.
- String data type functions that do equality comparisons and hashing
are changed to take the (non-)deterministic flag into account. For
comparison, this just means skipping various shortcuts and tie
breakers that use byte-wise comparison. For hashing, we first need
to convert the input string to a canonical "sort key" using the ICU
analogue of strxfrm().
Reviewed-by: Daniel Verite <daniel@manitou-mail.org>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://www.postgresql.org/message-id/flat/1ccc668f-4cbc-0bef-af67-450b47cdfee7@2ndquadrant.com
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/catalogs.sgml | 7 | ||||
-rw-r--r-- | doc/src/sgml/charset.sgml | 61 | ||||
-rw-r--r-- | doc/src/sgml/citext.sgml | 21 | ||||
-rw-r--r-- | doc/src/sgml/func.sgml | 6 | ||||
-rw-r--r-- | doc/src/sgml/ref/create_collation.sgml | 22 |
5 files changed, 112 insertions, 5 deletions
diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml index 0fd792ff1a2..45ed077654e 100644 --- a/doc/src/sgml/catalogs.sgml +++ b/doc/src/sgml/catalogs.sgml @@ -2078,6 +2078,13 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l </row> <row> + <entry><structfield>collisdeterministic</structfield></entry> + <entry><type>bool</type></entry> + <entry></entry> + <entry>Is the collation deterministic?</entry> + </row> + + <row> <entry><structfield>collencoding</structfield></entry> <entry><type>int4</type></entry> <entry></entry> diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml index a6143ef8a74..555d1b4ac63 100644 --- a/doc/src/sgml/charset.sgml +++ b/doc/src/sgml/charset.sgml @@ -847,11 +847,13 @@ CREATE COLLATION german (provider = libc, locale = 'de_DE'); <para> Note that while this system allows creating collations that <quote>ignore - case</quote> or <quote>ignore accents</quote> or similar (using - the <literal>ks</literal> key), PostgreSQL does not at the moment allow - such collations to act in a truly case- or accent-insensitive manner. Any - strings that compare equal according to the collation but are not - byte-wise equal will be sorted according to their byte values. + case</quote> or <quote>ignore accents</quote> or similar (using the + <literal>ks</literal> key), in order for such collations to act in a + truly case- or accent-insensitive manner, they also need to be declared as not + <firstterm>deterministic</firstterm> in <command>CREATE COLLATION</command>; + see <xref linkend="collation-nondeterministic"/>. + Otherwise, any strings that compare equal according to the collation but + are not byte-wise equal will be sorted according to their byte values. </para> <note> @@ -883,6 +885,55 @@ CREATE COLLATION french FROM "fr-x-icu"; </para> </sect4> </sect3> + + <sect3 id="collation-nondeterministic"> + <title>Nondeterminstic Collations</title> + + <para> + A collation is either <firstterm>deterministic</firstterm> or + <firstterm>nondeterministic</firstterm>. A deterministic collation uses + deterministic comparisons, which means that it considers strings to be + equal only if they consist of the same byte sequence. Nondeterministic + comparison may determine strings to be equal even if they consist of + different bytes. Typical situations include case-insensitive comparison, + accent-insensitive comparison, as well as comparion of strings in + different Unicode normal forms. It is up to the collation provider to + actually implement such insensitive comparisons; the deterministic flag + only determines whether ties are to be broken using bytewise comparison. + See also <ulink url="https://unicode.org/reports/tr10">Unicode Technical + Standard 10</ulink> for more information on the terminology. + </para> + + <para> + To create a nondeterministic collation, specify the property + <literal>deterministic = false</literal> to <command>CREATE + COLLATION</command>, for example: +<programlisting> +CREATE COLLATION ndcoll (provider = icu, locale = 'und', deterministic = false); +</programlisting> + This example would use the standard Unicode collation in a + nondeterministic way. In particular, this would allow strings in + different normal forms to be compared correctly. More interesting + examples make use of the ICU customization facilities explained above. + For example: +<programlisting> +CREATE COLLATION case_insensitive (provider = icu, locale = 'und-u-ks-level2', deterministic = false); +CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-true', deterministic = false); +</programlisting> + </para> + + <para> + All standard and predefined collations are deterministic, all + user-defined collations are deterministic by default. While + nondeterministic collations give a more <quote>correct</quote> behavior, + especially when considering the full power of Unicode and its many + special cases, they also have some drawbacks. Foremost, their use leads + to a performance penalty. Also, certain operations are not possible with + nondeterministic collations, such as pattern matching operations. + Therefore, they should be used only in cases where they are specifically + wanted. + </para> + </sect3> </sect2> </sect1> diff --git a/doc/src/sgml/citext.sgml b/doc/src/sgml/citext.sgml index b1fe7101b20..85aa339d8ba 100644 --- a/doc/src/sgml/citext.sgml +++ b/doc/src/sgml/citext.sgml @@ -14,6 +14,16 @@ exactly like <type>text</type>. </para> + <tip> + <para> + Consider using <firstterm>nondeterministic collations</firstterm> (see + <xref linkend="collation-nondeterministic"/>) instead of this module. They + can be used for case-insensitive comparisons, accent-insensitive + comparisons, and other combinations, and they handle more Unicode special + cases correctly. + </para> + </tip> + <sect2> <title>Rationale</title> @@ -246,6 +256,17 @@ SELECT * FROM users WHERE nick = 'Larry'; will be invoked instead. </para> </listitem> + + <listitem> + <para> + The approach of lower-casing strings for comparison does not handle some + Unicode special cases correctly, for example when one upper-case letter + has two lower-case letter equivalents. Unicode distinguishes between + <firstterm>case mapping</firstterm> and <firstterm>case + folding</firstterm> for this reason. Use nondeterministic collations + instead of <type>citext</type> to handle that correctly. + </para> + </listitem> </itemizedlist> </sect2> diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index 3a99e209a2b..1a014732919 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -4065,6 +4065,12 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation> </para> </caution> + <para> + The pattern matching operators of all three kinds do not support + nondeterministic collations. If required, apply a different collation to + the expression to work around this limitation. + </para> + <sect2 id="functions-like"> <title><function>LIKE</function></title> diff --git a/doc/src/sgml/ref/create_collation.sgml b/doc/src/sgml/ref/create_collation.sgml index 038797fce11..def4dda6e88 100644 --- a/doc/src/sgml/ref/create_collation.sgml +++ b/doc/src/sgml/ref/create_collation.sgml @@ -23,6 +23,7 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> ( [ LC_COLLATE = <replaceable>lc_collate</replaceable>, ] [ LC_CTYPE = <replaceable>lc_ctype</replaceable>, ] [ PROVIDER = <replaceable>provider</replaceable>, ] + [ DETERMINISTIC = <replaceable>boolean</replaceable>, ] [ VERSION = <replaceable>version</replaceable> ] ) CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replaceable>existing_collation</replaceable> @@ -125,6 +126,27 @@ CREATE COLLATION [ IF NOT EXISTS ] <replaceable>name</replaceable> FROM <replace </varlistentry> <varlistentry> + <term><literal>DETERMINISTIC</literal></term> + + <listitem> + <para> + Specifies whether the collation should use deterministic comparisons. + The default is true. A deterministic comparison considers strings that + are not byte-wise equal to be unequal even if they are considered + logically equal by the comparison. PostgreSQL breaks ties using a + byte-wise comparison. Comparison that is not deterministic can make the + collation be, say, case- or accent-insensitive. For that, you need to + choose an appropriate <literal>LC_COLLATE</literal> setting + <emphasis>and</emphasis> set the collation to not deterministic here. + </para> + + <para> + Nondeterministic collations are only supported with the ICU provider. + </para> + </listitem> + </varlistentry> + + <varlistentry> <term><replaceable>version</replaceable></term> <listitem> |