aboutsummaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
authorTom Lane <tgl@sss.pgh.pa.us>2019-05-14 11:27:31 -0400
committerTom Lane <tgl@sss.pgh.pa.us>2019-05-14 11:27:31 -0400
commit7c850320d8cfa5503ecec61c2559661b924f7595 (patch)
tree470c4e4f8068d391e102e553792e9351a980e28a /doc/src
parentfb489e4b3195fc33cccc0fd308e5a0ab502cf199 (diff)
downloadpostgresql-7c850320d8cfa5503ecec61c2559661b924f7595.tar.gz
postgresql-7c850320d8cfa5503ecec61c2559661b924f7595.zip
Fix SQL-style substring() to have spec-compliant greediness behavior.
SQL's regular-expression substring() function is defined to have a pattern argument that's separated into three subpatterns by escape- double-quote markers; the function result is the part of the input matching the second subpattern. The standard makes it clear that if there is ambiguity about how to match the input to the subpatterns, the first and third subpatterns should be taken to match the smallest possible amount of text (i.e., they're "non greedy", in the terms of our regex code). We were not doing it that way: the first subpattern would eat the largest possible amount of text, causing the function result to be shorter than what the spec requires. Fix that by attaching explicit greediness quantifiers to the subpatterns. (This depends on the regex fix in commit 8a29ed053; before that, this didn't reliably change the regex engine's behavior.) Also, by adding parentheses around each subpattern, we ensure that "|" (OR) in the subpatterns behave sanely. Previously, "|" in the first or third subpatterns didn't work. This patch also makes the function throw error if you write more than two escape-double-quote markers, and do something sane if you write just one, and document that behavior. Previously, an odd number of markers led to a confusing complaint about unbalanced parentheses, while extra pairs of markers were just ignored. (Note that the spec requires exactly two markers, but we've historically allowed there to be none, and this patch preserves the old behavior for that case.) In passing, adjust some substring() test cases that didn't really prove what they said they were testing for: they used patterns that didn't match the data string, so that the output would be NULL whether or not the function was really strict. Although this is certainly a bug fix, changing the behavior in back branches seems undesirable: applications could perhaps be depending on the old behavior, since it's not obviously wrong unless you read the spec very closely. Hence, no back-patch. Discussion: https://postgr.es/m/5bb27a41-350d-37bf-901e-9d26f5592dd0@charter.net
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/func.sgml42
1 files changed, 34 insertions, 8 deletions
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index da0f3059815..bc2275c8fee 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -4296,19 +4296,45 @@ cast(-44 as bit(12)) <lineannotation>111111010100</lineannotation>
</para>
<para>
- The <function>substring</function> function with three parameters,
- <function>substring(<replaceable>string</replaceable> from
- <replaceable>pattern</replaceable> for
- <replaceable>escape-character</replaceable>)</function>, provides
- extraction of a substring that matches an SQL
- regular expression pattern. As with <literal>SIMILAR TO</literal>, the
+ The <function>substring</function> function with three parameters
+ provides extraction of a substring that matches an SQL
+ regular expression pattern. The function can be written according
+ to SQL99 syntax:
+<synopsis>
+substring(<replaceable>string</replaceable> from <replaceable>pattern</replaceable> for <replaceable>escape-character</replaceable>)
+</synopsis>
+ or as a plain three-argument function:
+<synopsis>
+substring(<replaceable>string</replaceable>, <replaceable>pattern</replaceable>, <replaceable>escape-character</replaceable>)
+</synopsis>
+ As with <literal>SIMILAR TO</literal>, the
specified pattern must match the entire data string, or else the
function fails and returns null. To indicate the part of the
- pattern that should be returned on success, the pattern must contain
+ pattern for which the matching data sub-string is of interest,
+ the pattern should contain
two occurrences of the escape character followed by a double quote
(<literal>"</literal>). <!-- " font-lock sanity -->
The text matching the portion of the pattern
- between these markers is returned.
+ between these separators is returned when the match is successful.
+ </para>
+
+ <para>
+ The escape-double-quote separators actually
+ divide <function>substring</function>'s pattern into three independent
+ regular expressions; for example, a vertical bar (<literal>|</literal>)
+ in any of the three sections affects only that section. Also, the first
+ and third of these regular expressions are defined to match the smallest
+ possible amount of text, not the largest, when there is any ambiguity
+ about how much of the data string matches which pattern. (In POSIX
+ parlance, the first and third regular expressions are forced to be
+ non-greedy.)
+ </para>
+
+ <para>
+ As an extension to the SQL standard, <productname>PostgreSQL</productname>
+ allows there to be just one escape-double-quote separator, in which case
+ the third regular expression is taken as empty; or no separators, in which
+ case the first and third regular expressions are taken as empty.
</para>
<para>