Allow complemented character class escapes within regex brackets.

The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
author: Tom Lane <tgl@sss.pgh.pa.us> 2021-02-25 13:00:40 -0500
committer: Tom Lane <tgl@sss.pgh.pa.us> 2021-02-25 13:00:40 -0500
commit: 2a0af7fe460eb46f9af996075972bf7c2e3f211d (patch)
tree: dc99ebbf913c05e67796401ebbd1cabe4fad349b /doc/src
parent: 6b40d9bdbdc9f873868b0ddecacd9a307fc8ee26 (diff)
download: postgresql-2a0af7fe460eb46f9af996075972bf7c2e3f211d.tar.gz
postgresql-2a0af7fe460eb46f9af996075972bf7c2e3f211d.zip
1 files changed, 12 insertions, 13 deletions
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d8224272a57..860ae118264 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
     non-ASCII characters to belong to any of these classes.)
     In addition to these standard character
     classes, <productname>PostgreSQL</productname> defines
+    the <literal>word</literal> character class, which is the same as
+    <literal>alnum</literal> plus the underscore (<literal>_</literal>)
+    character, and
     the <literal>ascii</literal> character class, which contains exactly
     the 7-bit ASCII set.
    </para>
@@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
     matching empty strings at the beginning
     and end of a word respectively.  A word is defined as a sequence
     of word characters that is neither preceded nor followed by word
-    characters.  A word character is an <literal>alnum</literal> character (as
-    defined by the <acronym>POSIX</acronym> character class described above)
-    or an underscore.  This is an extension, compatible with but not
+    characters.  A word character is any character belonging to the
+    <literal>word</literal> character class, that is, any letter, digit,
+    or underscore.  This is an extension, compatible with but not
     specified by <acronym>POSIX</acronym> 1003.2, and should be used with
     caution in software intended to be portable to other systems.
     The constraint escapes described below are usually preferable; they
@@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
 
        <row>
        <entry> <literal>\w</literal> </entry>
-       <entry> <literal>[[:alnum:]_]</literal>
-       (note underscore is included) </entry>
+       <entry> <literal>[[:word:]]</literal> </entry>
        </row>
 
        <row>
@@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo;
 
        <row>
        <entry> <literal>\W</literal> </entry>
-       <entry> <literal>[^[:alnum:]_]</literal>
-       (note underscore is included) </entry>
+       <entry> <literal>[^[:word:]]</literal> </entry>
        </row>
       </tbody>
      </tgroup>
     </table>
 
    <para>
-    Within bracket expressions, <literal>\d</literal>, <literal>\s</literal>,
-    and <literal>\w</literal> lose their outer brackets,
-    and <literal>\D</literal>, <literal>\S</literal>, and <literal>\W</literal> are illegal.
-    (So, for example, <literal>[a-c\d]</literal> is equivalent to
+    The class-shorthand escapes also work within bracket expressions,
+    although the definitions shown above are not quite syntactically
+    valid in that context.
+    For example, <literal>[a-c\d]</literal> is equivalent to
     <literal>[a-c[:digit:]]</literal>.
-    Also, <literal>[a-c\D]</literal>, which is equivalent to
-    <literal>[a-c^[:digit:]]</literal>, is illegal.)
    </para>
 
    <table id="posix-constraint-escapes-table">
author	Tom Lane <tgl@sss.pgh.pa.us>	2021-02-25 13:00:40 -0500
committer	Tom Lane <tgl@sss.pgh.pa.us>	2021-02-25 13:00:40 -0500
commit	2a0af7fe460eb46f9af996075972bf7c2e3f211d (patch)
tree	dc99ebbf913c05e67796401ebbd1cabe4fad349b /doc/src
parent	6b40d9bdbdc9f873868b0ddecacd9a307fc8ee26 (diff)
download	postgresql-2a0af7fe460eb46f9af996075972bf7c2e3f211d.tar.gz postgresql-2a0af7fe460eb46f9af996075972bf7c2e3f211d.zip