aboutsummaryrefslogtreecommitdiff
path: root/src/backend/regex
Commit message (Collapse)AuthorAge
* Fix regex back-references that are directly quantified with *.Tom Lane2012-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The syntax "\n*", that is a backref with a * quantifier directly applied to it, has never worked correctly in Spencer's library. This has been an open bug in the Tcl bug tracker since 2005: https://sourceforge.net/tracker/index.php?func=detail&aid=1115587&group_id=10894&atid=110894 The core of the problem is in parseqatom(), which first changes "\n*" to "\n+|" and then applies repeat() to the NFA representing the backref atom. repeat() thinks that any arc leading into its "rp" argument is part of the sub-NFA to be repeated. Unfortunately, since parseqatom() already created the arc that was intended to represent the empty bypass around "\n+", this arc gets moved too, so that it now leads into the state loop created by repeat(). Thus, what was supposed to be an "empty" bypass gets turned into something that represents zero or more repetitions of the NFA representing the backref atom. In the original example, in place of ^([bc])\1*$ we now have something that acts like ^([bc])(\1+|[bc]*)$ At runtime, the branch involving the actual backref fails, as it's supposed to, but then the other branch succeeds anyway. We could no doubt fix this by some rearrangement of the operations in parseqatom(), but that code is plenty ugly already, and what's more the whole business of converting "x*" to "x+|" probably needs to go away to fix another problem I'll mention in a moment. Instead, this patch suppresses the *-conversion when the target is a simple backref atom, leaving the case of m == 0 to be handled at runtime. This makes the patch in regcomp.c a one-liner, at the cost of having to tweak cbrdissect() a little. In the event I went a bit further than that and rewrote cbrdissect() to check all the string-length-related conditions before it starts comparing characters. It seems a bit stupid to possibly iterate through many copies of an n-character backreference, only to fail at the end because the target string's length isn't a multiple of n --- we could have found that out before starting. The existing coding could only be a win if integer division is hugely expensive compared to character comparison, but I don't know of any modern machine where that might be true. This does not fix all the problems with quantified back-references. In particular, the code is still broken for back-references that appear within a larger expression that is quantified (so that direct insertion of the quantification limits into the BACKREF node doesn't apply). I think fixing that will take some major surgery on the NFA code, specifically introducing an explicit iteration node type instead of trying to transform iteration into concatenation of modified regexps. Back-patch to all supported branches. In HEAD, also add a regression test case for this. (It may seem a bit silly to create a regression test file for just one test case; but I'm expecting that we will soon import a whole bunch of regex regression tests from Tcl, so might as well create the infrastructure now.)
* Pgindent run before 9.1 beta2.Bruce Momjian2011-06-09
|
* Insert dummy "break"s to silence compiler complaints.Tom Lane2011-04-10
| | | | | Apparently some compilers dislike a case label with nothing after it. Per buildfarm.
* Teach regular expression operators to honor collations.Tom Lane2011-04-10
| | | | | | | | | | | | | | This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.
* pgindent run before PG 9.1 beta 1.Bruce Momjian2011-04-10
|
* Fix comparisons of pointers with zero to compare with NULL instead.Tom Lane2010-10-29
| | | | | | | Per C standard, these are semantically the same thing; but saying NULL when you mean NULL is good for readability. Marti Raudsepp, per results of INRIA's Coccinelle.
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Tweak a couple of macros in the regex code to suppress compiler warningsTom Lane2010-08-02
| | | | | | | | from "clang". The VERR changes make an assignment unconditional, which is probably easier to read/understand anyway, and one can hardly argue that it's worth shaving cycles off the case of reporting another error when one has already been detected. The INSIST change limits where that macro can be used, but not in a way that creates a problem for any existing call.
* pgindent run for 9.0Bruce Momjian2010-02-26
|
* Change regexp engine's ccondissect/crevdissect routines to perform DFATom Lane2010-02-01
| | | | | | | | | | | | | | | | | matching before recursing instead of after. The DFA match eliminates unworkable midpoint choices a lot faster than the recursive check, in most cases, so doing it first can speed things up; particularly in pathological cases such as recently exhibited by Michael Glaesemann. In addition, apply some cosmetic changes that were applied upstream (in the Tcl project) at the same time, in order to sync with upstream version 1.15 of regexec.c. Upstream apparently intends to backpatch this, so I will too. The pathological behavior could be unpleasant if encountered in the field, which seems to justify any risk of introducing new bugs. Tom Lane, reviewed by Donal K. Fellows of Tcl project
* Fix some comments that got mangled by pgindent.Tom Lane2010-01-30
|
* Teach the regular expression functions to do case-insensitive matching andTom Lane2009-12-01
| | | | | | | | | | | | | | | | | | | | locale-dependent character classification properly when the database encoding is UTF8. The previous coding worked okay in single-byte encodings, or in any case for ASCII characters, but failed entirely on multibyte characters. The fix assumes that the <wctype.h> functions use Unicode code points as the wchar representation for Unicode, ie, wchar matches pg_wchar. This is only a partial solution, since we're still stupid about non-ASCII characters in multibyte encodings other than UTF8. The practical effect of that is limited, however, since those cases are generally Far Eastern glyphs for which concepts like case-folding don't apply anyway. Certainly all or nearly all of the field reports of problems have been about UTF8. A more general solution would require switching to the platform's wchar representation for all regex operations; which is possible but would have substantial disadvantages. Let's try this and see if it's sufficient in practice.
* 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian2009-06-11
| | | | provided by Andrew.
* Refactor backend makefiles to remove lots of duplicate codePeter Eisentraut2008-02-19
|
* Sync our regex code with upstream changes since last time we did this, whichTom Lane2008-02-14
| | | | | | | | | | | | | was Tcl 8.4.8. The main changes are to remove the never-fully-implemented code for multi-character collating elements, and to const-ify some stuff a bit more fully. In combination with the recent security patch, this commit brings us into line with Tcl 8.5.0. Note that I didn't make any effort to duplicate a lot of cosmetic changes that they made to bring their copy into line with their own style guidelines, such as adding braces around single-line IF bodies. Most of those we either had done already (such as ANSI-fication of function headers) or there is no point because pgindent would undo the change anyway.
* Fix assorted security-grade bugs in the regex engine. All of these problemsTom Lane2008-01-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | are shared with Tcl, since it's their code to begin with, and the patches have been copied from Tcl 8.5.0. Problems: CVE-2007-4769: Inadequate check on the range of backref numbers allows crash due to out-of-bounds read. CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'. CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA representation, as well as crash if we encounter an out-of-memory condition during NFA construction. Part of the response to CVE-2007-6067 is to put a limit on the number of states in the NFA representation of a regex. This seems needed even though the within-the-code problems have been corrected, since otherwise the code could try to use very large amounts of memory for a suitably-crafted regex, leading to potential DOS by driving the system into swap, activating a kernel OOM killer, etc. Although there are certainly plenty of ways to drive the system into effective DOS with poorly-written SQL queries, these problems seem worth treating as security issues because many applications might accept regex search patterns from untrustworthy sources. Thanks to Will Drewry of Google for reporting these problems. Patches by Will Drewry and Tom Lane. Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
* pgindent run for 8.3.Bruce Momjian2007-11-15
|
* Add a useless return statement to suppress a warning seen with someTom Lane2007-10-22
| | | | | | versions of gcc (I'm seeing it with Apple's gcc 4.0.1). I think the reason we did not see this before was that the assert() macros in the regex code were all no-ops till recently.
* Make dumpcolors() have tolerable performance when using 32-bit chr,Tom Lane2007-10-06
| | | | | | as we do (and upstream Tcl doesn't). The loop limit might be subject to negotiation if anyone ever tries to do regex debugging in Far Eastern languages, but for now 1000 seems plenty. CHR_MAX was right out :-(
* Adjust some regex debugging printouts to not give wrong-format-widthTom Lane2007-10-06
| | | | | warnings on a 64-bit machine. Noted while chasing a recent regex bug report.
* Wording cleanup for error messages. Also change can't -> cannot.Bruce Momjian2007-02-01
| | | | | | | | | | | | | | Standard English uses "may", "can", and "might" in different ways: may - permission, "You may borrow my rake." can - ability, "I can lift that log." might - possibility, "It might rain today." Unfortunately, in conversational English, their use is often mixed, as in, "You may use this variable to do X", when in fact, "can" is a better choice. Similarly, "It may crash" is better stated, "It might crash".
* Re-run pgindent, fixing a problem where comment lines after a blankBruce Momjian2005-11-22
| | | | | | | | | comment line where output as too long, and update typedefs for /lib directory. Also fix case where identifiers were used as variable names in the backend, but as typedefs in ecpg (favor the backend for indenting). Backpatch to 8.1.X.
* Standard pgindent run for 8.1.Bruce Momjian2005-10-15
|
* Clean up possibly-uninitialized-variable warnings reported by gcc 4.x.Tom Lane2005-09-24
|
* I made the patch that implements regexp_replace again.Bruce Momjian2005-07-10
| | | | | | | | | | | | | | | | | | | | | | | The specification of this function is as follows. regexp_replace(source text, pattern text, replacement text, [flags text]) returns text Replace string that matches to regular expression in source text to replacement text. - pattern is regular expression pattern. - replacement is replace string that can use '\1'-'\9', and '\&'. '\1'-'\9': back reference to the n'th subexpression. '\&' : entire matched string. - flags can use the following values: g: global (replace all) i: ignore case When the flags is not specified, case sensitive, replace the first instance only. Atsushi Ogawa
* Add parentheses to macros when args are used in computations. WithoutBruce Momjian2005-05-25
| | | | them, the executation behavior could be unexpected.
* Install Tcl regex fixes to sync our regex engine with Tcl 8.4.8 (up fromTom Lane2004-11-24
| | | | | 8.4.1). This corrects some curious regex bugs, though not the greediness issue I was hoping to find a solution for :-(
* Solve the 'Turkish problem' with undesirable locale behavior for caseTom Lane2004-05-07
| | | | | | | | | | | | | conversion of basic ASCII letters. Remove all uses of strcasecmp and strncasecmp in favor of new functions pg_strcasecmp and pg_strncasecmp; remove most but not all direct uses of toupper and tolower in favor of pg_toupper and pg_tolower. These functions use the same notions of case folding already developed for identifier case conversion. I left the straight locale-based folding in place for situations where we are just manipulating user data and not trying to match it to built-in strings --- for example, the SQL upper() function is still locale dependent. Perhaps this will prove not to be what's wanted, but at the moment we can initdb and pass regression tests in Turkish locale.
* $Header: -> $PostgreSQL Changes ...PostgreSQL Daemon2003-11-29
|
* Fix broken definition of :print: character class, per Bruno Wolff.Tom Lane2003-09-29
| | | | | Also, make :alnum: character class directly dependent on isalnum() rather than guessing.
* Another pgindent run with updated typedefs.Bruce Momjian2003-08-08
|
* pgindent run.Bruce Momjian2003-08-04
|
* Replace regular expression package with Henry Spencer's latest versionTom Lane2003-02-05
| | | | | | | (extracted from Tcl 8.4.1 release, as Henry still hasn't got round to making it a separate library). This solves a performance problem for multibyte, as well as upgrading our regexp support to match recent Tcl and nearly match recent Perl.
* This patch removes a bunch of superfluous #include directives: ifBruce Momjian2002-11-08
| | | | | | | | postgres.h or c.h includes a system header (such as stdio.h or stdlib.h), there's no need to specifically include it in any of the .c files in the backend. Neil Conway
* Remove retest Makefile entry because it does not compile.Bruce Momjian2002-09-16
|
* pgindent run.Bruce Momjian2002-09-04
|
* Remove all traces of multibyte and locale options. Clean up commentsPeter Eisentraut2002-09-03
| | | | referring to "multibyte" where it really means character encoding.
* Remove sys/types.h in files that include postgres.h, and hence c.h,Bruce Momjian2002-09-02
| | | | because c.h has sys/types.h.
* Remove #ifdef MULTIBYTE per hackers list discussion.Tatsuo Ishii2002-08-29
|
* Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.Thomas G. Lockhart2002-06-11
| | | | | | | | | | | Implement SQL99 SIMILAR TO as a synonym for our existing operator "~". Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape). Extend the definition to make the FOR clause optional. Define textregexsubstr() to actually implement this feature. Update the regression test to include these new string features. All tests pass. Rename the regular expression support routines from "pg95_xxx" to "pg_xxx". Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
* Fix code to work when isalpha and friends are macros, not functions.Tom Lane2002-05-05
|
* [ Patch comments in three pieces.]Bruce Momjian2002-04-24
| | | | | | | | | | | | | | | | | | | | Attached is a pacth against 7.2 which adds locale awareness to the character classes of the regular expression engine. ... > > I still think the xdigit class could be handled the same way the digit > > class is (by enumeration rather than using the isxdigit function). That > > saves you a cicle, and I don't think there's any loss. > > In fact, I will email you when I apply the original patch. I miss that case :-(. Here is the pached patch. ... Here is a patch which addresses Tatsuo's concerns (it does return an static struct instead of constructing it).
* New pgindent run with fixes suggested by Tom. Patch manually reviewed,Bruce Momjian2001-11-05
| | | | initdb/regression tests pass.
* Another pgindent run. Fixes enum indenting, and improves #endifBruce Momjian2001-10-28
| | | | spacing. Also adds space for one-line comments.
* pgindent run on all C files. Java run to follow. initdb/regressionBruce Momjian2001-10-25
| | | | tests pass.
* Add do { ... } while (0) to more bad macros.Bruce Momjian2001-10-25
|
* Add dependency for regexec.cTatsuo Ishii2001-10-04
|
* pgindent run. Make it all clean.Bruce Momjian2001-03-22
|
* Make regular-expression error messages a tad less obscure,Tom Lane2001-03-19
| | | | per gripe from Josh Berkus.
* Clean up portability problems in regexp package: change all routineTom Lane2001-02-13
| | | | | | definitions from K&R to ANSI C style, and fix broken assumption that int and long are the same datatype. This repairs problems observed on Alpha with regexps having between 32 and 63 states.