Add SQL functions for Unicode normalization

This adds SQL expressions NORMALIZE() and IS NORMALIZED to convert and check Unicode normal forms, per SQL standard. To support fast IS NORMALIZED tests, we pull in a new data file DerivedNormalizationProps.txt from Unicode and build a lookup table from that, using techniques similar to ones already used for other Unicode data. make update-unicode will keep it up to date. We only build and use these tables for the NFC and NFKC forms, because they are too big for NFD and NFKD and the improvement is not significant enough there. Reviewed-by: Daniel Verite <daniel@manitou-mail.org> Reviewed-by: Andreas Karlsson <andreas@proxel.se> Discussion: https://www.postgresql.org/message-id/flat/c1909f27-c269-2ed9-12f8-3ab72c8caf7a@2ndquadrant.com
author: Peter Eisentraut <peter@eisentraut.org> 2020-03-26 08:14:00 +0100
committer: Peter Eisentraut <peter@eisentraut.org> 2020-04-02 08:56:27 +0200
commit: 2991ac5fc9b3904ca4582be6d323497d7c3d17c9 (patch)
tree: d558847de39ee972b261026d4846f1f31e8dff12 /doc/src
parent: 070c3d3937e75e04d36405287353b7eca516555d (diff)
download: postgresql-2991ac5fc9b3904ca4582be6d323497d7c3d17c9.tar.gz
postgresql-2991ac5fc9b3904ca4582be6d323497d7c3d17c9.zip
2 files changed, 58 insertions, 0 deletions
diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 20cdfabd7bf..b6023fa459e 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -934,6 +934,16 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
      such as pattern matching operations.  Therefore, they should be used
      only in cases where they are specifically wanted.
     </para>
+
+    <tip>
+     <para>
+      To deal with text in different Unicode normalization forms, it is also
+      an option to use the functions/expressions
+      <function>normalize</function> and <literal>is normalized</literal> to
+      preprocess or check the strings, instead of using nondeterministic
+      collations.  There are different trade-offs for each approach.
+     </para>
+    </tip>
    </sect3>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index cbfd2a762e4..a329f61f339 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -1563,6 +1563,30 @@
       <row>
        <entry>
         <indexterm>
+         <primary>normalized</primary>
+        </indexterm>
+        <indexterm>
+         <primary>Unicode normalization</primary>
+        </indexterm>
+        <literal><parameter>string</parameter> is <optional>not</optional> <optional><parameter>form</parameter></optional> normalized</literal>
+       </entry>
+       <entry><type>boolean</type></entry>
+       <entry>
+        Checks whether the string is in the specified Unicode normalization
+        form.  The optional parameter specifies the form:
+        <literal>NFC</literal> (default), <literal>NFD</literal>,
+        <literal>NFKC</literal>, <literal>NFKD</literal>.  This expression can
+        only be used if the server encoding is <literal>UTF8</literal>.  Note
+        that checking for normalization using this expression is often faster
+        than normalizing possibly already normalized strings.
+       </entry>
+       <entry><literal>U&amp;'\0061\0308bc' IS NFD NORMALIZED</literal></entry>
+       <entry><literal>true</literal></entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
          <primary>bit_length</primary>
         </indexterm>
         <literal><function>bit_length(<parameter>string</parameter>)</function></literal>
@@ -1613,6 +1637,30 @@
       <row>
        <entry>
         <indexterm>
+         <primary>normalize</primary>
+        </indexterm>
+        <indexterm>
+         <primary>Unicode normalization</primary>
+        </indexterm>
+        <literal><function>normalize(<parameter>string</parameter> <type>text</type>
+        <optional>, <parameter>form</parameter> </optional>)</function></literal>
+       </entry>
+       <entry><type>text</type></entry>
+       <entry>
+        Converts the string in the first argument to the specified Unicode
+        normalization form.  The optional second argument specifies the form
+        as an identifier: <literal>NFC</literal> (default),
+        <literal>NFD</literal>, <literal>NFKC</literal>,
+        <literal>NFKD</literal>.  This function can only be used if the server
+        encoding is <literal>UTF8</literal>.
+       </entry>
+       <entry><literal>normalize(U&amp;'\0061\0308bc', NFC)</literal></entry>
+       <entry><literal>U&amp;'\00E4bc'</literal></entry>
+      </row>
+
+      <row>
+       <entry>
+        <indexterm>
          <primary>octet_length</primary>
         </indexterm>
         <literal><function>octet_length(<parameter>string</parameter>)</function></literal>
author	Peter Eisentraut <peter@eisentraut.org>	2020-03-26 08:14:00 +0100
committer	Peter Eisentraut <peter@eisentraut.org>	2020-04-02 08:56:27 +0200
commit	2991ac5fc9b3904ca4582be6d323497d7c3d17c9 (patch)
tree	d558847de39ee972b261026d4846f1f31e8dff12 /doc/src
parent	070c3d3937e75e04d36405287353b7eca516555d (diff)
download	postgresql-2991ac5fc9b3904ca4582be6d323497d7c3d17c9.tar.gz postgresql-2991ac5fc9b3904ca4582be6d323497d7c3d17c9.zip