From 2a0af7fe460eb46f9af996075972bf7c2e3f211d Mon Sep 17 00:00:00 2001 From: Tom Lane Date: Thu, 25 Feb 2021 13:00:40 -0500 Subject: Allow complemented character class escapes within regex brackets. The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us --- doc/src/sgml/func.sgml | 25 ++++++++++++------------- 1 file changed, 12 insertions(+), 13 deletions(-) (limited to 'doc/src') diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml index d8224272a57..860ae118264 100644 --- a/doc/src/sgml/func.sgml +++ b/doc/src/sgml/func.sgml @@ -6097,6 +6097,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; non-ASCII characters to belong to any of these classes.) In addition to these standard character classes, PostgreSQL defines + the word character class, which is the same as + alnum plus the underscore (_) + character, and the ascii character class, which contains exactly the 7-bit ASCII set. @@ -6108,9 +6111,9 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; matching empty strings at the beginning and end of a word respectively. A word is defined as a sequence of word characters that is neither preceded nor followed by word - characters. A word character is an alnum character (as - defined by the POSIX character class described above) - or an underscore. This is an extension, compatible with but not + characters. A word character is any character belonging to the + word character class, that is, any letter, digit, + or underscore. This is an extension, compatible with but not specified by POSIX 1003.2, and should be used with caution in software intended to be portable to other systems. The constraint escapes described below are usually preferable; they @@ -6330,8 +6333,7 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; \w - [[:alnum:]_] - (note underscore is included) + [[:word:]] @@ -6346,21 +6348,18 @@ SELECT foo FROM regexp_split_to_table('the quick brown fox', '\s*') AS foo; \W - [^[:alnum:]_] - (note underscore is included) + [^[:word:]] - Within bracket expressions, \d, \s, - and \w lose their outer brackets, - and \D, \S, and \W are illegal. - (So, for example, [a-c\d] is equivalent to + The class-shorthand escapes also work within bracket expressions, + although the definitions shown above are not quite syntactically + valid in that context. + For example, [a-c\d] is equivalent to [a-c[:digit:]]. - Also, [a-c\D], which is equivalent to - [a-c^[:digit:]], is illegal.) -- cgit v1.2.3