aboutsummaryrefslogtreecommitdiff
path: root/src/include/regex/regex.h
Commit message (Collapse)AuthorAge
* Implement lookbehind constraints in our regular-expression engine.Tom Lane2015-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
* Improve memory-usage accounting in regular-expression compiler.Tom Lane2015-10-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This code previously counted the number of NFA states it created, and complained if a limit was exceeded, so as to prevent bizarre regex patterns from consuming unreasonable time or memory. That's fine as far as it went, but the code paid no attention to how many arcs linked those states. Since regexes can be contrived that have O(N) states but will need O(N^2) arcs after fixempties() processing, it was still possible to blow out memory, and take a long time doing it too. To fix, modify the bookkeeping to count space used by both states and arcs. I did not bother with including the "color map" in the accounting; it can only grow to a few megabytes, which is not a lot in comparison to what we're allowing for states+arcs (about 150MB on 64-bit machines or half that on 32-bit machines). Looking at some of the larger real-world regexes captured in the Tcl regression test suite suggests that the most that is likely to be needed for regexes found in the wild is under 10MB, so I believe that the current limit has enough headroom to make it okay to keep it as a hard-wired limit. In connection with this, redefine REG_ETOOBIG as meaning "regular expression is too complex"; the previous wording of "nfa has too many states" was already somewhat inapropos because of the error code's use for stack depth overrun, and it was not very user-friendly either. Back-patch to all supported branches.
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Allow regex operations to be terminated early by query cancel requests.Tom Lane2014-03-01
| | | | | | | | | | | | | | | | | | | | | | | | | The regex code didn't have any provision for query cancel; which is unsurprising given its non-Postgres origin, but still problematic since some operations can take a long time. Introduce a callback function to check for a pending query cancel or session termination request, and call it in a couple of strategic spots where we can make the regex code exit with an error indicator. If we ever actually split out the regex code as a standalone library, some additional work will be needed to let the cancel callback function be specified externally to the library. But that's straightforward (certainly so by comparison to putting the locale-dependent character classification logic on a similar arms-length basis), and there seems no need to do it right now. A bigger issue is that there may be more places than these two where we need to check for cancels. We can always add more checks later, now that the infrastructure is in place. Since there are known examples of not-terribly-long regexes that can lock up a backend for a long time, back-patch to all supported branches. I have hopes of fixing the known performance problems later, but adding query cancel ability seems like a good idea even if they were all fixed.
* Fix crash on compiling a regular expression with more than 32k colors.Heikki Linnakangas2013-04-04
| | | | | | Throw an error instead. Backpatch to all supported branches.
* Re-implement extraction of fixed prefixes from regular expressions.Tom Lane2012-07-10
| | | | | | | | | | | | | | | | | | | | | | | To generate btree-indexable conditions from regex WHERE conditions (such as WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed prefix that a regex might have; that is, find any string that must be a prefix of all strings satisfying the regex. We used to do that with entirely ad-hoc code that looked at the source text of the regex. It didn't know very much about regex syntax, which mostly meant that it would fail to identify some optimizable cases; but Viktor Rosenfeld reported that it would produce actively wrong answers for quantified parenthesized subexpressions, such as '^(foo)?bar'. Rather than trying to extend the ad-hoc code to cover this, let's get rid of it altogether in favor of identifying prefixes by examining the compiled form of a regex. To do this, I've added a new entry point "pg_regprefix" to the regex library; hopefully it is defined in a sufficiently general fashion that it can remain in the library when/if that code gets split out as a standalone project. Since this bug has been there for a very long time, this fix needs to get back-patched. However it depends on some other recent commits (particularly the addition of wchar-to-database-encoding conversion), so I'll commit this separately and then go to work on back-porting the necessary fixes.
* Teach regular expression operators to honor collations.Tom Lane2011-04-10
| | | | | | | | | | | | | | This involves getting the character classification and case-folding functions in the regex library to use the collations infrastructure. Most of this work had been done already in connection with the upper/lower and LIKE logic, so it was a simple matter of transposition. While at it, split out these functions into a separate source file regc_pg_locale.c, so that they can be correctly labeled with the Postgres project's license rather than the Scriptics license. These functions are 100% Postgres-written code whereas what remains in regc_locale.c is still mostly not ours, so lumping them both under the same copyright notice was getting more and more misleading.
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Remove regex_flavor GUC, so that regular expressions are always "advanced"Tom Lane2009-10-21
| | | | | | | | | style by default. Per discussion, there seems to be hardly anything that really relies on being able to change the regex flavor, so the ability to select it via embedded options ought to be enough for any stragglers. Also, if we didn't remove the GUC, we'd really be morally obligated to mark the regex functions non-immutable, which'd possibly create performance issues.
* 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian2009-06-11
| | | | provided by Andrew.
* Convert three more guc settings to enum type:Magnus Hagander2008-04-02
| | | | default_transaction_isolation, session_replication_role and regex_flavor.
* Fix assorted security-grade bugs in the regex engine. All of these problemsTom Lane2008-01-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | are shared with Tcl, since it's their code to begin with, and the patches have been copied from Tcl 8.5.0. Problems: CVE-2007-4769: Inadequate check on the range of backref numbers allows crash due to out-of-bounds read. CVE-2007-4772: Infinite loop in regex optimizer for pattern '($|^)*'. CVE-2007-6067: Very slow optimizer cleanup for regex with a large NFA representation, as well as crash if we encounter an out-of-memory condition during NFA construction. Part of the response to CVE-2007-6067 is to put a limit on the number of states in the NFA representation of a regex. This seems needed even though the within-the-code problems have been corrected, since otherwise the code could try to use very large amounts of memory for a suitably-crafted regex, leading to potential DOS by driving the system into swap, activating a kernel OOM killer, etc. Although there are certainly plenty of ways to drive the system into effective DOS with poorly-written SQL queries, these problems seem worth treating as security issues because many applications might accept regex search patterns from untrustworthy sources. Thanks to Will Drewry of Google for reporting these problems. Patches by Will Drewry and Tom Lane. Security: CVE-2007-4769, CVE-2007-4772, CVE-2007-6067
* Standard pgindent run for 8.1.Bruce Momjian2005-10-15
|
* I made the patch that implements regexp_replace again.Bruce Momjian2005-07-10
| | | | | | | | | | | | | | | | | | | | | | | The specification of this function is as follows. regexp_replace(source text, pattern text, replacement text, [flags text]) returns text Replace string that matches to regular expression in source text to replacement text. - pattern is regular expression pattern. - replacement is replace string that can use '\1'-'\9', and '\&'. '\1'-'\9': back reference to the n'th subexpression. '\&' : entire matched string. - flags can use the following values: g: global (replace all) i: ignore case When the flags is not specified, case sensitive, replace the first instance only. Atsushi Ogawa
* make sure the $Id tags are converted to $PostgreSQL as well ...PostgreSQL Daemon2003-11-29
|
* Another pgindent run with updated typedefs.Bruce Momjian2003-08-08
|
* pgindent run.Bruce Momjian2003-08-04
|
* Replace regular expression package with Henry Spencer's latest versionTom Lane2003-02-05
| | | | | | | (extracted from Tcl 8.4.1 release, as Henry still hasn't got round to making it a separate library). This solves a performance problem for multibyte, as well as upgrading our regexp support to match recent Tcl and nearly match recent Perl.
* pgindent run.Bruce Momjian2002-09-04
|
* Remove #ifdef MULTIBYTE per hackers list discussion.Tatsuo Ishii2002-08-29
|
* Implement SQL99 OVERLAY(). Allows substitution of a substring in a string.Thomas G. Lockhart2002-06-11
| | | | | | | | | | | Implement SQL99 SIMILAR TO as a synonym for our existing operator "~". Implement SQL99 regular expression SUBSTRING(string FROM pat FOR escape). Extend the definition to make the FOR clause optional. Define textregexsubstr() to actually implement this feature. Update the regression test to include these new string features. All tests pass. Rename the regular expression support routines from "pg95_xxx" to "pg_xxx". Define CREATE CHARACTER SET in the parser per SQL99. No implementation yet.
* New pgindent run with fixes suggested by Tom. Patch manually reviewed,Bruce Momjian2001-11-05
| | | | initdb/regression tests pass.
* Another pgindent run. Fixes enum indenting, and improves #endifBruce Momjian2001-10-28
| | | | spacing. Also adds space for one-line comments.
* pgindent run on all C files. Java run to follow. initdb/regressionBruce Momjian2001-10-25
| | | | tests pass.
* pgindent run. Make it all clean.Bruce Momjian2001-03-22
|
* Clean up portability problems in regexp package: change all routineTom Lane2001-02-13
| | | | | | definitions from K&R to ANSI C style, and fix broken assumption that int and long are the same datatype. This repairs problems observed on Alpha with regexps having between 32 and 63 states.
* Restructure the key include files per recent pghackers discussion: thereTom Lane2001-02-10
| | | | | | | | | | | are now separate files "postgres.h" and "postgres_fe.h", which are meant to be the primary include files for backend .c files and frontend .c files respectively. By default, only include files meant for frontend use are installed into the installation include directory. There is a new make target 'make install-all-headers' that adds the whole content of the src/include tree to the installed fileset, for use by people who want to develop server-side code without keeping the complete source tree on hand. Cleaned up a whole lot of crufty and inconsistent header inclusions.
* Removed MBFLAGS from makefiles since it's now done in include/config.h.Peter Eisentraut2000-01-19
|
* Change #include's to use <> and "" as appropriate.Bruce Momjian1999-07-15
|
* OK, folks, here is the pgindent output.Bruce Momjian1998-09-01
|
* I really hope that I haven't missed anything in this one...Marc G. Fournier1998-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | From: t-ishii@sra.co.jp Attached are patches to enhance the multi-byte support. (patches are against 7/18 snapshot) * determine encoding at initdb/createdb rather than compile time Now initdb/createdb has an option to specify the encoding. Also, I modified the syntax of CREATE DATABASE to accept encoding option. See README.mb for more details. For this purpose I have added new column "encoding" to pg_database. Also pg_attribute and pg_class are changed to catch up the modification to pg_database. Actually I haved added pg_database_mb.h, pg_attribute_mb.h and pg_class_mb.h. These are used only when MB is enabled. The reason having separate files is I couldn't find a way to use ifdef or whatever in those files. I have to admit it looks ugly. No way. * support for PGCLIENTENCODING when issuing COPY command commands/copy.c modified. * support for SQL92 syntax "SET NAMES" See gram.y. * support for LATIN2-5 * add UNICODE regression test case * new test suite for MB New directory test/mb added. * clean up source files Basic idea is to have MB's own subdirectory for easier maintenance. These are include/mb and backend/utils/mb.
* Add auto-size to screen to \d? commands. Use UNION to show allBruce Momjian1998-07-18
| | | | | \d? results in one query. Add \d? field search feature. Rename MB to MULTIBYTE.
* From: t-ishii@sra.co.jpMarc G. Fournier1998-03-15
| | | | | | | | | | | | | | | Included are patches intended for allowing PostgreSQL to handle multi-byte charachter sets such as EUC(Extende Unix Code), Unicode and Mule internal code. With the MB patch you can use multi-byte character sets in regexp and LIKE. The encoding system chosen is determined at the compile time. To enable the MB extension, you need to define a variable "MB" in Makefile.global or in Makefile.custom. For further information please take a look at README.mb under doc directory. (Note that unlike "jp patch" I do not use modified GNU regexp any more. I changed Henry Spencer's regexp coming with PostgreSQL.)
* Used modified version of indent that understands over 100 typedefs.Bruce Momjian1997-09-08
|
* Another PGINDENT run that changes variable indenting and case label ↵Bruce Momjian1997-09-08
| | | | indenting. Also static variable indenting.
* Massive commit to run PGINDENT on all *.c and *.h files.Bruce Momjian1997-09-07
|
* Make compile on AIX, Alpha OSF. Thanks Darren King, Igor Notanzon.Bryan Henderson1996-12-15
|
* ...same...Marc G. Fournier1996-11-06
|
* Added needed include file.Bruce Momjian1996-10-31
|
* Add the regex include files to the repository...Marc G. Fournier1996-09-20
In my cvs source tree, tihs directory existed, which is why it compiled on my system, but nobody elses...