aboutsummaryrefslogtreecommitdiff
path: root/src/backend/regex/regexec.c
Commit message (Collapse)AuthorAge
* Fix regex match failures for backrefs combined with non-greedy quantifiers.Tom Lane2013-07-18
| | | | | | | | | | | | An ancient logic error in cfindloop() could cause the regex engine to fail to find matches that begin later than the start of the string. This function is only used when the regex pattern contains a back reference, and so far as we can tell the error is only reachable if the pattern is non-greedy (i.e. its first quantifier uses the ? modifier). Furthermore, the actual match must begin after some potential match that satisfies the DFA but then fails the back-reference's match test. Reported and fixed by Jeevan Chalke, with cosmetic adjustments by me.
* Run pgindent on 9.2 source tree in preparation for first 9.3Bruce Momjian2012-06-10
| | | | commit-fest.
* Fix array overrun in regex code.Tom Lane2012-05-24
| | | | | | | | | | | | | | | | | | | zaptreesubs() was coded to unconditionally reset a capture subre's corresponding pmatch[] entry. However, in regexes without backrefs, that array is caller-supplied and might not have as many entries as the regex has capturing parens. So check the array length and do nothing if there is no corresponding entry, much as subset() does. Failure to check this resulted in a stack clobber in the case reported by Marko Kreen. This bug appears to have been latent in the regex library from the beginning. It was not exposed because find() called dissect() not cdissect(), and the dissect() code path didn't ever call zaptreesubs() (formerly zapmem()). When I unified dissect() and cdissect() in commit 4dd78bf37aa29d04b3f358b08c4a2fa43cf828e7, the problem was exposed. Now that I've seen this, I'm rather suspicious that we might need to back-patch it; but will refrain for now, for lack of evidence that the case can be hit in the previous coding.
* Merge dissect() into cdissect() to remove a pile of near-duplicate code.Tom Lane2012-02-24
| | | | | | | | | The "uncomplicated" case isn't materially less complicated than the full case, certainly not enough so to justify duplicating nearly 500 lines of code. The only extra work being done in the full path is zaptreesubs, which is very cheap compared to everything else being done here, and besides that I'm less than convinced that it's not needed in some cases even without backrefs.
* Avoid repeated creation/freeing of per-subre DFAs during regex search.Tom Lane2012-02-24
| | | | | | | | | | | | In nested sub-regex trees, lower-level nodes created DFAs and then destroyed them again before exiting, which is a bit dumb considering that the recursive search is likely to call those nodes again later. Instead cache each created DFA until the end of pg_regexec(). This is basically a space for time tradeoff, in that it might increase the maximum memory usage. However, in most regex patterns there are not all that many subre nodes, so not that many DFAs --- and in any case, the peak usage occurs when reaching the bottom recursion level, and except for alternation cases that's going to be the same anyway.
* Remove useless "retry memory" logic within regex engine.Tom Lane2012-02-24
| | | | | | | | | | | | | | | | | | Apparently some primordial version of Spencer's engine needed cdissect() and child functions to be able to continue matching from a previous position when re-called. That is dead code, though, since trivial inspection shows that cdissect can never be entered without having previously done zapmem which resets the relevant retry counter. I have also verified experimentally that no case in the Tcl regression tests reaches cdissect with a nonzero retry value. Accordingly, remove that logic. This doesn't really save any noticeable number of cycles in itself, but it is one step towards making dissect() and cdissect() equivalent, which will allow removing hundreds of lines of near-duplicated code. Since struct subre's "retry" field is no longer particularly related to any kind of retry, rename it to "id". As of this commit it's only used for identifying a subre node in debug printouts, so you might think we should get rid of the field entirely; but I have a plan for another use.
* Fix the general case of quantified regex back-references.Tom Lane2012-02-24
| | | | | | | | | | | | | | Cases where a back-reference is part of a larger subexpression that is quantified have never worked in Spencer's regex engine, because he used a compile-time transformation that neglected the need to check the back-reference match in iterations before the last one. (That was okay for capturing parens, and we still do it if the regex has *only* capturing parens ... but it's not okay for backrefs.) To make this work properly, we have to add an "iteration" node type to the regex engine's vocabulary of sub-regex nodes. Since this is a moderately large change with a fair risk of introducing new bugs of its own, apply to HEAD only, even though it's a fix for a longstanding bug.
* Fix regex back-references that are directly quantified with *.Tom Lane2012-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syntax "\n*", that is a backref with a * quantifier directly applied to it, has never worked correctly in Spencer's library. This has been an open bug in the Tcl bug tracker since 2005: https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894 The core of the problem is in parseqatom(), which first changes "\n*" to "\n+|" and then applies repeat() to the NFA representing the backref atom. repeat() thinks that any arc leading into its "rp" argument is part of the sub-NFA to be repeated. Unfortunately, since parseqatom() already created the arc that was intended to represent the empty bypass around "\n+", this arc gets moved too, so that it now leads into the state loop created by repeat(). Thus, what was supposed to be an "empty" bypass gets turned into something that represents zero or more repetitions of the NFA representing the backref atom. In the original example, in place of ^([bc])\1*$ we now have something that acts like ^([bc])(\1+|[bc]*)$ At runtime, the branch involving the actual backref fails, as it's supposed to, but then the other branch succeeds anyway. We could no doubt fix this by some rearrangement of the operations in parseqatom(), but that code is plenty ugly already, and what's more the whole business of converting "x*" to "x+|" probably needs to go away to fix another problem I'll mention in a moment. Instead, this patch suppresses the *-conversion when the target is a simple backref atom, leaving the case of m == 0 to be handled at runtime. This makes the patch in regcomp.c a one-liner, at the cost of having to tweak cbrdissect() a little. In the event I went a bit further than that and rewrote cbrdissect() to check all the string-length-related conditions before it starts comparing characters. It seems a bit stupid to possibly iterate through many copies of an n-character backreference, only to fail at the end because the target string's length isn't a multiple of n --- we could have found that out before starting. The existing coding could only be a win if integer division is hugely expensive compared to character comparison, but I don't know of any modern machine where that might be true. This does not fix all the problems with quantified back-references. In particular, the code is still broken for back-references that appear within a larger expression that is quantified (so that direct insertion of the quantification limits into the BACKREF node doesn't apply). I think fixing that will take some major surgery on the NFA code, specifically introducing an explicit iteration node type instead of trying to transform iteration into concatenation of modified regexps. Back-patch to all supported branches. In HEAD, also add a regression test case for this. (It may seem a bit silly to create a regression test file for just one test case; but I'm expecting that we will soon import a whole bunch of regex regression tests from Tcl, so might as well create the infrastructure now.)
* Sync regex code with Tcl 8.5.11.Tom Lane2012-02-17
| | | | | | | | | Sync our regex code with upstream changes since last time we did this, which was Tcl 8.5.0 (see commit df1e965e12cdd48c11057ee6e15346ee2b8b02f5). There are no functional changes here; the main point is just to lay down a commit-log marker that somebody has looked at this recently, and to do what we can to keep the two codebases comparable.
* Teach regular expression operators to honor collations.Tom Lane2011-04-10
| | | | | | | | | | | | | | This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Tweak a couple of macros in the regex code to suppress compiler warningsTom Lane2010-08-02
| | | | | | | | from "clang". The VERR changes make an assignment unconditional, which is probably easier to read/understand anyway, and one can hardly argue that it's worth shaving cycles off the case of reporting another error when one has already been detected. The INSIST change limits where that macro can be used, but not in a way that creates a problem for any existing call.
* Change regexp engine's ccondissect/crevdissect routines to perform DFATom Lane2010-02-01
| | | | | | | | | | | | | | | | | matching before recursing instead of after. The DFA match eliminates unworkable midpoint choices a lot faster than the recursive check, in most cases, so doing it first can speed things up; particularly in pathological cases such as recently exhibited by Michael Glaesemann. In addition, apply some cosmetic changes that were applied upstream (in the Tcl project) at the same time, in order to sync with upstream version 1.15 of regexec.c. Upstream apparently intends to backpatch this, so I will too. The pathological behavior could be unpleasant if encountered in the field, which seems to justify any risk of introducing new bugs. Tom Lane, reviewed by Donal K. Fellows of Tcl project
* Standard pgindent run for 8.1.Bruce Momjian2005-10-15
|
* Clean up possibly-uninitialized-variable warnings reported by gcc 4.x.Tom Lane2005-09-24
|
* I made the patch that implements regexp_replace again.Bruce Momjian2005-07-10
| | | | | | | | | | | | | | | | | | | | | | | The specification of this function is as follows. regexp_replace(source text, pattern text, replacement text, [flags text]) returns text Replace string that matches to regular expression in source text to replacement text. - pattern is regular expression pattern. - replacement is replace string that can use '\1'-'\9', and '\&'. '\1'-'\9': back reference to the n'th subexpression. '\&' : entire matched string. - flags can use the following values: g: global (replace all) i: ignore case When the flags is not specified, case sensitive, replace the first instance only. Atsushi Ogawa
* $Header: -> $PostgreSQL Changes ...PostgreSQL Daemon2003-11-29
|
* Another pgindent run with updated typedefs.Bruce Momjian2003-08-08
|
* pgindent run.Bruce Momjian2003-08-04
|
* Replace regular expression package with Henry Spencer's latest versionTom Lane2003-02-05
| | | | | | | (extracted from Tcl 8.4.1 release, as Henry still hasn't got round to making it a separate library). This solves a performance problem for multibyte, as well as upgrading our regexp support to match recent Tcl and nearly match recent Perl.
* This patch removes a bunch of superfluous #include directives: ifBruce Momjian2002-11-08
| | | | | | | | postgres.h or c.h includes a system header (such as stdio.h or stdlib.h), there's no need to specifically include it in any of the .c files in the backend. Neil Conway
* Remove #ifdef MULTIBYTE per hackers list discussion.Tatsuo Ishii2002-08-29
|
* Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.Thomas G. Lockhart2002-06-11
| | | | | | | | | | | Implement SQL99 SIMILAR TO as a synonym for our existing operator "~". Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape). Extend the definition to make the FOR clause optional. Define textregexsubstr() to actually implement this feature. Update the regression test to include these new string features. All tests pass. Rename the regular expression support routines from "pg95_xxx" to "pg_xxx". Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
* pgindent run on all C files. Java run to follow. initdb/regressionBruce Momjian2001-10-25
| | | | tests pass.
* Add do { ... } while (0) to more bad macros.Bruce Momjian2001-10-25
|
* pgindent run. Make it all clean.Bruce Momjian2001-03-22
|
* Clean up portability problems in regexp package: change all routineTom Lane2001-02-13
| | | | | | definitions from K&R to ANSI C style, and fix broken assumption that int and long are the same datatype. This repairs problems observed on Alpha with regexps having between 32 and 63 states.
* Removed MBFLAGS from makefiles since it's now done in include/config.h.Peter Eisentraut2000-01-19
|
* Change #include's to use <> and "" as appropriate.Bruce Momjian1999-07-15
|
* OK, folks, here is the pgindent output.Bruce Momjian1998-09-01
|
* Renaming cleanup, no pgindent yet.Bruce Momjian1998-09-01
|
* Add auto-size to screen to \d? commands. Use UNION to show allBruce Momjian1998-07-18
| | | | | \d? results in one query. Add \d? field search feature. Rename MB to MULTIBYTE.
* From: t-ishii@sra.co.jpMarc G. Fournier1998-03-15
| | | | | | | | | | | | | | | Included are patches intended for allowing PostgreSQL to handle multi-byte charachter sets such as EUC(Extende Unix Code), Unicode and Mule internal code. With the MB patch you can use multi-byte character sets in regexp and LIKE. The encoding system chosen is determined at the compile time. To enable the MB extension, you need to define a variable "MB" in Makefile.global or in Makefile.custom. For further information please take a look at README.mb under doc directory. (Note that unlike "jp patch" I do not use modified GNU regexp any more. I changed Henry Spencer's regexp coming with PostgreSQL.)
* Goodbye register keyword. Compiler knows better.Bruce Momjian1998-02-11
|
* Another PGINDENT run that changes variable indenting and case label ↵Bruce Momjian1997-09-08
| | | | indenting. Also static variable indenting.
* Massive commit to run PGINDENT on all *.c and *.h files.Bruce Momjian1997-09-07
|
* Compile and warning cleanupBruce Momjian1996-11-08
|
* Various patches from Bryan that *should* clean up the compile problemsMarc G. Fournier1996-09-20
| | | | ppl are seeing with v2.0
* Moved the include files to src/include/regexMarc G. Fournier1996-08-28
|
* Postgres95 1.01 Distribution - Virgin SourcesPG95-1_01Marc G. Fournier1996-07-09