diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2021-03-02 11:34:53 -0500 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2021-03-02 11:34:53 -0500 |
commit | 4aea704a5bfd4b5894a268499369ccab89940c9c (patch) | |
tree | 26b6918c1e79027b088d02f1e2d560c050931ee4 /src/test/modules/test_regex/sql/test_regex.sql | |
parent | c5530d8474a482d32c0d9bb099707d9a8e117f96 (diff) | |
download | postgresql-4aea704a5bfd4b5894a268499369ccab89940c9c.tar.gz postgresql-4aea704a5bfd4b5894a268499369ccab89940c9c.zip |
Fix semantics of regular expression back-references.
POSIX defines the behavior of back-references thus:
The back-reference expression '\n' shall match the same (possibly
empty) string of characters as was matched by a subexpression
enclosed between "\(" and "\)" preceding the '\n'.
As far as I can see, the back-reference is supposed to consider only
the data characters matched by the referenced subexpression. However,
because our engine copies the NFA constructed from the referenced
subexpression, it effectively enforces any constraints therein, too.
As an example, '(^.)\1' ought to match 'xx', or any other string
starting with two occurrences of the same character; but in our code
it does not, and indeed can't match anything, because the '^' anchor
constraint is included in the backref's copied NFA. If POSIX intended
that, you'd think they'd mention it. Perl for one doesn't act that
way, so it's hard to conclude that this isn't a bug.
Fix by modifying the backref's NFA immediately after it's copied from
the reference, replacing all constraint arcs by EMPTY arcs so that the
constraints are treated as automatically satisfied. This still allows
us to enforce matching rules that depend only on the data characters;
for example, in '(^\d+).*\1' the NFA matching step will still know
that the backref can only match strings of digits.
Perhaps surprisingly, this change does not affect the results of any
of a rather large corpus of real-world regexes. Nonetheless, I would
not consider back-patching it, since it's a clear compatibility break.
Patch by me, reviewed by Joel Jacobson
Discussion: https://postgr.es/m/661609.1614560029@sss.pgh.pa.us
Diffstat (limited to 'src/test/modules/test_regex/sql/test_regex.sql')
-rw-r--r-- | src/test/modules/test_regex/sql/test_regex.sql | 5 |
1 files changed, 5 insertions, 0 deletions
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql index b99329391e8..7f5bc6e418f 100644 --- a/src/test/modules/test_regex/sql/test_regex.sql +++ b/src/test/modules/test_regex/sql/test_regex.sql @@ -770,6 +770,11 @@ select * from test_regex('^(.+)( \1)+$', 'abc abd abc', 'RP'); -- expectNomatch 14.29 RP {^(.+)( \1)+$} {abc abc abd} select * from test_regex('^(.+)( \1)+$', 'abc abc abd', 'RP'); +-- back reference only matches the string, not any constraints +select * from test_regex('(^\w+).*\1', 'abc abc abc', 'LRP'); +select * from test_regex('(^\w+\M).*\1', 'abc abcd abd', 'LRP'); +select * from test_regex('(\w+(?= )).*\1', 'abc abcd abd', 'HLRP'); + -- doing 15 "octal escapes vs back references" -- # initial zero is always octal |