aboutsummaryrefslogtreecommitdiff
path: root/src/include/regex
Commit message (Collapse)AuthorAge
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Avoid assertion due to disconnected NFA sub-graphs in regex parsing.Tom Lane2024-11-15
| | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 08c0d6ad6 which introduced "rainbow" arcs in regex NFAs, I didn't think terribly hard about what to do when creating the color complement of a rainbow arc. Clearly, the complement cannot match any characters, and I took the easy way out by just not building any arcs at all in the complement arc set. That mostly works, but Nikolay Shaplov found a case where it doesn't: if we decide to delete that sub-NFA later because it's inside a "{0}" quantifier, delsub() suffered an assertion failure. That's because delsub() relies on the target sub-NFA being fully connected. That was always true before, and the best fix seems to be to restore that property. Hence, invent a new arc type CANTMATCH that can be generated in place of an empty color complement, and drop it again later when we start NFA optimization. (At that point we don't need to do delsub() any more, and besides there are other cases where NFA optimization can lead to disconnected subgraphs.) It appears that this bug has no consequences in a non-assert-enabled build: there will be some transiently leaked NFA states/arcs, but they'll get cleaned up eventually. Still, we don't like assertion failures, so back-patch to v14 where rainbow arcs were introduced. Per bug #18708 from Nikolay Shaplov. Discussion: https://postgr.es/m/18708-f94f2599c9d2c005@postgresql.org
* Cope with <regex.h> name clashes.Thomas Munro2024-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | macOS 15's SDK pulls in headers related to <regex.h> when we include <xlocale.h>. This causes our own regex_t implementation to clash with the OS's regex_t implementation. Luckily our function names already had pg_ prefixes, but the macros and typenames did not. Include <regex.h> explicitly on all POSIX systems, and fix everything that breaks. Then we can prove that we are capable of fully hiding and replacing the system regex API with our own. 1. Deal with standard-clobbering macros by undefining them all first. POSIX says they are "symbolic constants". If they are macros, this allows us to redefine them. If they are enums or variables, our macros will hide them. 2. Deal with standard-clobbering types by giving our types pg_ prefixes, and then using macros to redirect xxx_t -> pg_xxx_t. After including our "regex/regex.h", the system <regex.h> is hidden, because we've replaced all the standard names. The PostgreSQL source tree and extensions can continue to use standard prefix-less type and macro names, but reach our implementation, if they included our "regex/regex.h" header. Back-patch to all supported branches, so that macOS 15's tool chain can build them. Reported-by: Stan Hu <stanhu@gmail.com> Suggested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://postgr.es/m/CAMBWrQnEwEJtgOv7EUNsXmFw2Ub4p5P%2B5QTBEgYwiyjy7rAsEQ%40mail.gmail.com
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Fix various typosDavid Rowley2023-04-18
| | | | | | | | | | | | This fixes many spelling mistakes in comments, but a few references to invalid parameter names, function names and option names too in comments and also some in string constants Also, fix an #undef that was undefining the incorrect definition Author: Alexander Lakhin Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/d5f68d19-c0fc-91a9-118d-7c6a5a3f5fad@gmail.com
* Redesign interrupt/cancel API for regex engine.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, a PostgreSQL-specific callback checked by the regex engine had a way to trigger a special error code REG_CANCEL if it detected that the next call to CHECK_FOR_INTERRUPTS() would certainly throw via ereport(). A later proposed bugfix aims to move some complex logic out of signal handlers, so that it won't run until the next CHECK_FOR_INTERRUPTS(), which makes the above design impossible unless we split CHECK_FOR_INTERRUPTS() into two phases, one to run logic and another to ereport(). We may develop such a system in the future, but for the regex code it is no longer necessary. An earlier commit moved regex memory management over to our MemoryContext system. Given that the purpose of the two-phase interrupt checking was to free memory before throwing, something we don't need to worry about anymore, it seems simpler to inject CHECK_FOR_INTERRUPTS() directly into cancelation points, and just let it throw. Since the plan is to keep PostgreSQL-specific concerns separate from the main regex engine code (with a view to bein able to stay in sync with other projects), do this with a new macro INTERRUPT(), customizable in regcustom.h and defaulting to nothing. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
* Use MemoryContext API for regex memory management.Thomas Munro2023-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | | Previously, regex_t objects' memory was managed with malloc() and free() directly. Switch to palloc()-based memory management instead. Advantages: * memory used by cached regexes is now visible with MemoryContext observability tools * cleanup can be done automatically in certain failure modes (something that later commits will take advantage of) * cleanup can be done in bulk On the downside, there may be more fragmentation (wasted memory) due to per-regex MemoryContext objects. This is a problem shared with other cached objects in PostgreSQL and can probably be improved with later tuning. Thanks to Noah Misch for suggesting this general approach, which unblocks later work on interrupts. Suggested-by: Noah Misch <noah@leadboat.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGK3PGKwcKqzoosamn36YW-fsuTdOPPF1i_rtEO%3DnEYKSg%40mail.gmail.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Consistently use named parameters in regex code.Peter Geoghegan2022-09-19
| | | | | | | | | | | | Make regex code consistently use named parameters in function declarations. Also make sure that parameter names from each function's declaration match corresponding definition parameter names. This makes Henry Spencer's regex code follow Postgres coding standards. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAH2-WznJt9CMM9KJTMjJh_zbL5hD9oX44qdJ4aqZtjFi-zA3Tg@mail.gmail.com
* Remove configure probe for wctype.h.Thomas Munro2022-07-23
| | | | | | | | | | | This header is present in SUSv2 and Windows. Also remove the inclusion of <wchar.h>, following clues that it was only included for the benefit of historical systems that didn't have <wctype.h>. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKGKAmTgbg_hMiGG5T7pkpzOnY1cWFAHYtZXHCpqeC_hCkA%40mail.gmail.com
* Update copyright for 2022Bruce Momjian2022-01-07
| | | | Backpatch-through: 10
* Avoid determining regexp subexpression matches, when possible.Tom Lane2021-08-09
| | | | | | | | | | | | | | | | | | | | | Identifying the precise match locations for parenthesized subexpressions is a fairly expensive task given the way our regexp engine works, both at regexp compile time (where we must create an optimized NFA for each parenthesized subexpression) and at runtime (where determining exact match locations requires laborious search). Up to now we've made little attempt to optimize this situation. This patch identifies cases where we know at compile time that we won't need to know subexpression match locations, and teaches the regexp compiler to not bother creating per-subexpression regexps for parenthesis pairs that are not referenced by backrefs elsewhere in the regexp. (To preserve semantics, we obviously still have to pin down the match locations of backref references.) Users could have obtained the same results before this by being careful to write "non capturing" parentheses wherever possible, but few people bother with that. Discussion: https://postgr.es/m/2219936.1628115334@sss.pgh.pa.us
* Improve memory management in regex compiler.Tom Lane2021-02-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous logic here created a separate pool of arcs for each state, so that the out-arcs of each state were physically stored within it. Perhaps this choice was driven by trying to not include a "from" pointer within each arc; but Spencer gave up on that idea long ago, and it's hard to see what the value is now. The approach turns out to be fairly disastrous in terms of memory consumption, though. In the first place, NFAs built by this engine seem to have about 4 arcs per state on average, with a majority having only one or two out-arcs. So pre-allocating 10 out-arcs for each state is already cause for a factor of two or more bloat. Worse, the NFA optimization phase moves arcs around with abandon. In a large NFA, some of the states will have hundreds of out-arcs, so towards the end of the optimization phase we have a significant number of states whose arc pools have room for hundreds of arcs each, even though only a few of those arcs are in use. We have seen real-world regexes in which this effect bloats the memory requirement by 25X or even more. Hence, get rid of the per-state arc pools in favor of a single arc pool for the whole NFA, with variable-sized allocation batches instead of always asking for 10 at a time. While we're at it, let's batch the allocations of state structs too, to further reduce the malloc traffic. This incidentally allows moveouts() to be optimized in a similar way to moveins(): when moving an arc to another state, it's now valid to just re-link the same arc struct into a different outchain, where before the code invariants required us to make a physically new arc and then free the old one. These changes reduce the regex compiler's typical space consumption for average-size regexes by about a factor of two, and much more for large or complicated regexes. In a large test set of real-world regexes, we formerly had half a dozen cases that failed with "regular expression too complex" due to exceeding the REG_MAX_COMPILE_SPACE limit (about 150MB); we would have had to raise that limit to something close to 400MB to make them work with the old code. Now, none of those cases need more than 13MB to compile. Furthermore, the test set is about 10% faster overall due to less malloc traffic. Discussion: https://postgr.es/m/168861.1614298592@sss.pgh.pa.us
* Allow complemented character class escapes within regex brackets.Tom Lane2021-02-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The complement-class escapes \D, \S, \W are now allowed within bracket expressions. There is no semantic difficulty with doing that, but the rather hokey macro-expansion-based implementation previously used here couldn't cope. Also, invent "word" as an allowed character class name, thus "\w" is now equivalent to "[[:word:]]" outside brackets, or "[:word:]" within brackets. POSIX allows such implementation-specific extensions, and the same name is used in e.g. bash. One surprising compatibility issue this raises is that constructs such as "[\w-_]" are now disallowed, as our documentation has always said they should be: character classes can't be endpoints of a range. Previously, because \w was just a macro for "[:alnum:]_", such a construct was read as "[[:alnum:]_-_]", so it was accepted so long as the character after "-" was numerically greater than or equal to "_". Some implementation cleanup along the way: * Remove the lexnest() hack, and in consequence clean up wordchrs() to not interact with the lexer. * Fix colorcomplement() to not be O(N^2) in the number of colors involved. * Get rid of useless-as-far-as-I-can-see calls of element() on single-character character element names in brackpart(). element() always maps these to the character itself, and things would be quite broken if it didn't --- should "[a]" match something different than "a" does? Besides, the shortcut path in brackpart() wasn't doing this anyway, making it even more inconsistent. Discussion: https://postgr.es/m/2845172.1613674385@sss.pgh.pa.us Discussion: https://postgr.es/m/3220564.1613859619@sss.pgh.pa.us
* Avoid generating extra subre tree nodes for capturing parentheses.Tom Lane2021-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, each pair of capturing parentheses gave rise to a separate subre tree node, whose only function was to identify that we ought to capture the match details for this particular sub-expression. In most cases we don't really need that, since we can perfectly well put a "capture this" annotation on the child node that does the real matching work. As with the two preceding commits, the main value of this is to avoid generating and optimizing an NFA for a tree node that's not really pulling its weight. The chosen data representation only allows one capture annotation per subre node. In the legal-per-spec, but seemingly not very useful, case where there are multiple capturing parens around the exact same bit of the regex (i.e. "((xyz))"), wrap the child node in N-1 capture nodes that act the same as before. We could work harder at that but I'll refrain, pending some evidence that such cases are worth troubling over. In passing, improve the comments in regex.h to say what all the different re_info bits mean. Some of them were pretty obvious but others not so much, so reverse-engineer some documentation. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
* Convert regex engine's subre tree from binary to N-ary style.Tom Lane2021-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having left and right child links in subre structs, have a single child link plus a sibling link. Multiple children of a tree node are now reached by chasing the sibling chain. The beneficiary of this is alternation tree nodes. A regular expression with N (>1) branches is now represented by one alternation node with N children, rather than a tree that includes N alternation nodes as well as N children. While the old representation didn't really cost anything extra at execution time, it was pretty horrid for compilation purposes, because each of the alternation nodes had its own NFA, which we were too stupid not to separately optimize. (To make matters worse, all of those NFAs described the entire alternation pattern, not just the portion of it that one might expect from the tree structure.) We continue to require concatenation nodes to have exactly two children. This data structure is now prepared to support more, but the executor's logic would need some careful redesign, and it's not clear that a lot of benefit could be had. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
* Recognize "match-all" NFAs within the regex engine.Tom Lane2021-02-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This builds on the previous "rainbow" patch to detect NFAs that will match any string, though possibly with constraints on the string length. This definition is chosen to match constructs such as ".*", ".+", and ".{1,100}". Recognizing such an NFA after the optimization pass is fairly cheap, since we basically just have to verify that all arcs are RAINBOW arcs and count the number of steps to the end state. (Well, there's a bit of complication with pseudo-color arcs for string boundary conditions, but not much.) Once we have these markings, the regex executor functions longest(), shortest(), and matchuntil() don't have to expend per-character work to determine whether a given substring satisfies such an NFA; they just need to check its length against the bounds. Since some matching problems require O(N) invocations of these functions, we've reduced the runtime for an N-character string from O(N^2) to O(N). Of course, this is no help for non-matchall sub-patterns, but those usually have constraints that allow us to avoid needing O(N) substring checks in the first place. It's precisely the unconstrained "match-all" cases that cause the most headaches. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
* Invent "rainbow" arcs within the regex engine.Tom Lane2021-02-20
| | | | | | | | | | | | | | | | | | | | | | | Some regular expression constructs, most notably the "." match-anything metacharacter, produce a sheaf of parallel NFA arcs covering all possible colors (that is, character equivalence classes). We can make a noticeable improvement in the space and time needed to process large regexes by replacing such cases with a single arc bearing the special color code "RAINBOW". This requires only minor additional complication in places such as pull() and push(). Callers of pg_reg_getoutarcs() must now be prepared for the possibility of seeing a RAINBOW arc. For the one known user, contrib/pg_trgm, that's a net benefit since it cuts the number of arcs to be dealt with, and the handling isn't any different than for other colors that contain too many characters to be dealt with individually. This is part of a patch series that in total reduces the regex engine's runtime by about a factor of four on a large corpus of real-world regexes. Patch by me, reviewed by Joel Jacobson Discussion: https://postgr.es/m/1340281.1613018383@sss.pgh.pa.us
* Minor fixes to improve regex debugging code.Tom Lane2021-02-14
| | | | | | | | | | | | | | | When REG_DEBUG is defined, ensure that an un-filled "struct cnfa" is all-zeroes, not just that it has nstates == 0. This is mainly so that looking at "struct subre" structs in gdb doesn't distract one with a lot of garbage fields during regex compilation. Adjust some places that print debug output to have suitable fflush calls afterwards. In passing, correct an erroneous ancient comment: the concatenation subre-s created by parsebranch() have op == '.' not ','. Noted while fooling around with some regex performance improvements.
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Assume that we have <wchar.h>.Tom Lane2020-02-21
| | | | | | | | | | | | | Windows has this, and so do all other live platforms according to the buildfarm; it's been required by POSIX since SUSv2. So remove the configure probe and tests of HAVE_WCHAR_H. This is part of a series of commits to get rid of no-longer-relevant configure checks and dead src/port/ code. I'm committing them separately to make it easier to back out individual changes if they prove less portable than I expect. Discussion: https://postgr.es/m/15379.1582221614@sss.pgh.pa.us
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Phase 2 pgindent run for v12.Tom Lane2019-05-22
| | | | | | | | | Switch to 2.1 version of pg_bsd_indent. This formats multiline function declarations "correctly", that is with additional lines of parameter declarations indented to match where the first line's left parenthesis is. Discussion: https://postgr.es/m/CAEepm=0P3FeTXRcU5B2W3jv3PgRVZ-kGUXLGfd42FFhUROO3ug@mail.gmail.com
* Partial implementation of SQL/JSON path languageAlexander Korotkov2019-03-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SQL 2016 standards among other things contains set of SQL/JSON features for JSON processing inside of relational database. The core of SQL/JSON is JSON path language, allowing access parts of JSON documents and make computations over them. This commit implements partial support JSON path language as separate datatype called "jsonpath". The implementation is partial because it's lacking datetime support and suppression of numeric errors. Missing features will be added later by separate commits. Support of SQL/JSON features requires implementation of separate nodes, and it will be considered in subsequent patches. This commit includes following set of plain functions, allowing to execute jsonpath over jsonb values: * jsonb_path_exists(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_match(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_query(jsonb, jsonpath[, jsonb, bool]), * jsonb_path_query_array(jsonb, jsonpath[, jsonb, bool]). * jsonb_path_query_first(jsonb, jsonpath[, jsonb, bool]). This commit also implements "jsonb @? jsonpath" and "jsonb @@ jsonpath", which are wrappers over jsonpath_exists(jsonb, jsonpath) and jsonpath_predicate(jsonb, jsonpath) correspondingly. These operators will have an index support (implemented in subsequent patches). Catversion bumped, to add new functions and operators. Code was written by Nikita Glukhov and Teodor Sigaev, revised by me. Documentation was written by Oleg Bartunov and Liudmila Mantrova. The work was inspired by Oleg Bartunov. Discussion: https://postgr.es/m/fcc6fc6a-b497-f39a-923d-aa34d0c588e8%402ndQuadrant.com Author: Nikita Glukhov, Teodor Sigaev, Alexander Korotkov, Oleg Bartunov, Liudmila Mantrova Reviewed-by: Tomas Vondra, Andrew Dunstan, Pavel Stehule, Alexander Korotkov
* Update copyright for 2019Bruce Momjian2019-01-02
| | | | Backpatch-through: certain files through 9.4
* Update copyright for 2018Bruce Momjian2018-01-02
| | | | Backpatch-through: certain files through 9.3
* Phase 2 of pgindent updates.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change pg_bsd_indent to follow upstream rules for placement of comments to the right of code, and remove pgindent hack that caused comments following #endif to not obey the general rule. Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using the published version of pg_bsd_indent, but a hacked-up version that tried to minimize the amount of movement of comments to the right of code. The situation of interest is where such a comment has to be moved to the right of its default placement at column 33 because there's code there. BSD indent has always moved right in units of tab stops in such cases --- but in the previous incarnation, indent was working in 8-space tab stops, while now it knows we use 4-space tabs. So the net result is that in about half the cases, such comments are placed one tab stop left of before. This is better all around: it leaves more room on the line for comment text, and it means that in such cases the comment uniformly starts at the next 4-space tab stop after the code, rather than sometimes one and sometimes two tabs after. Also, ensure that comments following #endif are indented the same as comments following other preprocessor commands such as #else. That inconsistency turns out to have been self-inflicted damage from a poorly-thought-through post-indent "fixup" in pgindent. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Initial pgindent run with pg_bsd_indent version 2.0.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new indent version includes numerous fixes thanks to Piotr Stefaniak. The main changes visible in this commit are: * Nicer formatting of function-pointer declarations. * No longer unexpectedly removes spaces in expressions using casts, sizeof, or offsetof. * No longer wants to add a space in "struct structname *varname", as well as some similar cases for const- or volatile-qualified pointers. * Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely. * Fixes bug where comments following declarations were sometimes placed with no space separating them from the code. * Fixes some odd decisions for comments following case labels. * Fixes some cases where comments following code were indented to less than the expected column 33. On the less good side, it now tends to put more whitespace around typedef names that are not listed in typedefs.list. This might encourage us to put more effort into typedef name collection; it's not really a bug in indent itself. There are more changes coming after this round, having to do with comment indentation and alignment of lines appearing within parentheses. I wanted to limit the size of the diffs to something that could be reviewed without one's eyes completely glazing over, so it seemed better to split up the changes as much as practical. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Document intentional violations of header inclusion policy.Tom Lane2017-03-08
| | | | | | | | | | | | | | | | | | Although there are good reasons for our policy of including postgres.h as the first #include in every .c file, never from .h files, there are two places where it seems expedient to violate the policy because the alternative is to modify externally-supplied .c files. (In the case of the regexp library, the idea that it's externally-supplied is kind of at odds with reality, but I haven't entirely given up hope that it will become a standalone project some day.) Add some comments to make it explicit that this is a policy violation and provide the reasoning. In passing, move #include "miscadmin.h" out of regcomp.c and into regcustom.h, which is where it should be if we're taking this reasoning seriously at all. Discussion: https://postgr.es/m/CAEepm=2zCoeq3QxVwhS5DFeUh=yU6z81pbWMgfOB8OzyiBwxzw@mail.gmail.com Discussion: https://postgr.es/m/11634.1488932128@sss.pgh.pa.us
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Make locale-dependent regex character classes work for large char codes.Tom Lane2016-09-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we failed to recognize Unicode characters above U+7FF as being members of locale-dependent character classes such as [[:alpha:]]. (Actually, the same problem occurs for large pg_wchar values in any multibyte encoding, but UTF8 is the only case people have actually complained about.) It's impractical to get Spencer's original code to handle character classes or ranges containing many thousands of characters, because it insists on considering each member character individually at regex compile time, whether or not the character will ever be of interest at run time. To fix, choose a cutoff point MAX_SIMPLE_CHR below which we process characters individually as before, and deal with entire ranges or classes as single entities above that. We can actually make things cheaper than before for chars below the cutoff, because the color map can now be a simple linear array for those chars, rather than the multilevel tree structure Spencer designed. It's more expensive than before for chars above the cutoff, because we must do a binary search in a list of high chars and char ranges used in the regex pattern, plus call iswalpha() and friends for each locale-dependent character class used in the pattern. However, multibyte encodings are normally designed to give smaller codes to popular characters, so that we can expect that the slow path will be taken relatively infrequently. In any case, the speed penalty appears minor except when we have to apply iswalpha() etc. to high character codes at runtime --- and the previous coding gave wrong answers for those cases, so whether it was faster is moot. Tom Lane, reviewed by Heikki Linnakangas Discussion: <15563.1471913698@sss.pgh.pa.us>
* Clean up another pre-ANSI-C-ism in regex code: get rid of pcolor typedef.Tom Lane2016-08-19
| | | | | | pcolor was used to represent function arguments that are nominally of type color, but when using a pre-ANSI C compiler would be passed as the promoted integer type. We really don't need that anymore.
* Remove typedef celt from the regex library, along with macro NOCELT.Tom Lane2016-08-19
| | | | | | | | | | | | | | | | | | | | | The regex library used to have a notion of a "collating element" that was distinct from a "character", but Henry Spencer never actually implemented his planned support for multi-character collating elements, and the Tcl crew ripped out most of the stubs for that years ago. The only thing left that distinguished the "celt" typedef from the "chr" typedef was that "celt" was supposed to also be able to hold the not-a-character "NOCELT" value. However, NOCELT was not used anywhere after the MCCE stub removal changes, which means there's no need for celt to be different from chr. Removing the separate typedef simplifies matters and also removes a trap for the unwary, in that celt is signed while chr may not be, so comparisons could mean different things. There's no bug there today because we restrict CHR_MAX to be less than INT_MAX, but I think there may have been such bugs before we did that, and there could be again if anyone ever decides to fool with the range of chr. This patch also removes assorted unnecessary casts to "chr" of values that are already chrs. Many of these seem to be leftover from days when the code was compatible with pre-ANSI C.
* Suppress compiler warnings about useless comparison of unsigned to zero.Tom Lane2016-02-15
| | | | | | | | | | | | Reportedly, some compilers warn about tests like "c < 0" if c is unsigned, and hence complain about the character range checks I added in commit 3bb3f42f3749d40b8d4de65871e8d828b18d4a45. This is a bit of a pain since the regex library doesn't really want to assume that chr is unsigned. However, since any such reconfiguration would involve manual edits of regcustom.h anyway, we can put it on the shoulders of whoever wants to do that to adjust this new range-checking macro correctly. Per gripes from Coverity and Andres.
* Fix some regex issues with out-of-range characters and large char ranges.Tom Lane2016-02-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, our regex code defined CHR_MAX as 0xfffffffe, which is a bad choice because it is outside the range of type "celt" (int32). Characters approaching that limit could lead to infinite loops in logic such as "for (c = a; c <= b; c++)" where c is of type celt but the range bounds are chr. Such loops will work safely only if CHR_MAX+1 is representable in celt, since c must advance to beyond b before the loop will exit. Fortunately, there seems no reason not to restrict CHR_MAX to 0x7ffffffe. It's highly unlikely that Unicode will ever assign codes that high, and none of our other backend encodings need characters beyond that either. In addition to modifying the macro, we have to explicitly enforce character range restrictions on the values of \u, \U, and \x escape sequences, else the limit is trivially bypassed. Also, the code for expanding case-independent character ranges in bracket expressions had a potential integer overflow in its calculation of the number of characters it could generate, which could lead to allocating too small a character vector and then overwriting memory. An attacker with the ability to supply arbitrary regex patterns could easily cause transient DOS via server crashes, and the possibility for privilege escalation has not been ruled out. Quite aside from the integer-overflow problem, the range expansion code was unnecessarily inefficient in that it always produced a result consisting of individual characters, abandoning the knowledge that we had a range to start with. If the input range is large, this requires excessive memory. Change it so that the original range is reported as-is, and then we add on any case-equivalent characters that are outside that range. With this approach, we can bound the number of individual characters allowed without sacrificing much. This patch allows at most 100000 individual characters, which I believe to be more than the number of case pairs existing in Unicode, so that the restriction will never be hit in practice. It's still possible for range() to take awhile given a large character code range, so also add statement-cancel detection to its loop. The downstream function dovec() also lacked cancel detection, and could take a long time given a large output from range(). Per fuzz testing by Greg Stark. Back-patch to all supported branches. Security: CVE-2016-0773
* Update copyright for 2016Bruce Momjian2016-01-02
| | | | Backpatch certain files through 9.1
* Implement lookbehind constraints in our regular-expression engine.Tom Lane2015-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A lookbehind constraint is like a lookahead constraint in that it consumes no text; but it checks for existence (or nonexistence) of a match *ending* at the current point in the string, rather than one *starting* at the current point. This is a long-requested feature since it exists in many other regex libraries, but Henry Spencer had never got around to implementing it in the code we use. Just making it work is actually pretty trivial; but naive copying of the logic for lookahead constraints leads to code that often spends O(N^2) time to scan an N-character string, because we have to run the match engine from string start to the current probe point each time the constraint is checked. In typical use-cases a lookbehind constraint will be written at the start of the regex and hence will need to be checked at every character --- so O(N^2) work overall. To fix that, I introduced a third copy of the core DFA matching loop, paralleling the existing longest() and shortest() loops. This version, matchuntil(), can suspend and resume matching given a couple of pointers' worth of storage space. So we need only run it across the string once, stopping at each interesting probe point and then resuming to advance to the next one. I also put in an optimization that simplifies one-character lookahead and lookbehind constraints, such as "(?=x)" or "(?<!\w)", into AHEAD and BEHIND constraints, which already existed in the engine. This avoids the overhead of the LACON machinery entirely for these rather common cases. The net result is that lookbehind constraints run a factor of three or so slower than Perl's for multi-character constraints, but faster than Perl's for one-character constraints ... and they work fine for variable-length constraints, which Perl gives up on entirely. So that's not bad from a competitive perspective, and there's room for further optimization if anyone cares. (In reality, raw scan rate across a large input string is probably not that big a deal for Postgres usage anyway; so I'm happy if it's linear.)
* Improve memory-usage accounting in regular-expression compiler.Tom Lane2015-10-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This code previously counted the number of NFA states it created, and complained if a limit was exceeded, so as to prevent bizarre regex patterns from consuming unreasonable time or memory. That's fine as far as it went, but the code paid no attention to how many arcs linked those states. Since regexes can be contrived that have O(N) states but will need O(N^2) arcs after fixempties() processing, it was still possible to blow out memory, and take a long time doing it too. To fix, modify the bookkeeping to count space used by both states and arcs. I did not bother with including the "color map" in the accounting; it can only grow to a few megabytes, which is not a lot in comparison to what we're allowing for states+arcs (about 150MB on 64-bit machines or half that on 32-bit machines). Looking at some of the larger real-world regexes captured in the Tcl regression test suite suggests that the most that is likely to be needed for regexes found in the wild is under 10MB, so I believe that the current limit has enough headroom to make it okay to keep it as a hard-wired limit. In connection with this, redefine REG_ETOOBIG as meaning "regular expression is too complex"; the previous wording of "nfa has too many states" was already somewhat inapropos because of the error code's use for stack depth overrun, and it was not very user-friendly either. Back-patch to all supported branches.
* Fix O(N^2) performance problems in regular-expression compiler.Tom Lane2015-10-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the singly-linked in-arc and out-arc lists to be doubly-linked, so that arc deletion is constant time rather than having worst-case time proportional to the number of other arcs on the connected states. Modify the bulk arc transfer operations copyins(), copyouts(), moveins(), moveouts() so that they use a sort-and-merge algorithm whenever there's more than a small number of arcs to be copied or moved. The previous method is O(N^2) in the number of arcs involved, because it performs duplicate checking independently for each copied arc. The new method may change the ordering of existing arcs for the destination state, but nothing really cares about that. Provide another bulk arc copying method mergeins(), which is unused as of this commit but is needed for the next one. It basically is like copyins(), but the source arcs might not all come from the same state. Replace the O(N^2) bubble-sort algorithm used in carcsort() with a qsort() call. These changes greatly improve the performance of regex compilation for large or complex regexes, at the cost of extra space for arc storage during compilation. The original tradeoff was probably fine when it was made, but now we care more about speed and less about memory consumption. Back-patch to all supported branches.
* Add recursion depth protections to regular expression matching.Tom Lane2015-10-02
| | | | | | | | | | | | | | | Some of the functions in regex compilation and execution recurse, and therefore could in principle be driven to stack overflow. The Tcl crew has seen this happen in practice in duptraverse(), though their fix was to put in a hard-wired limit on the number of recursive levels, which is not too appetizing --- fortunately, we have enough infrastructure to check the actually available stack. Greg Stark has also seen it in other places while fuzz testing on a machine with limited stack space. Let's put guards in to prevent crashes in all these places. Since the regex code would leak memory if we simply threw elog(ERROR), we have to introduce an API that checks for stack depth without throwing such an error. Fortunately that's not difficult.
* Fix low-probability memory leak in regex execution.Tom Lane2015-09-18
| | | | | | | | | | | | | | | | After an internal failure in shortest() or longest() while pinning down the exact location of a match, find() forgot to free the DFA structure before returning. This is pretty unlikely to occur, since we just successfully ran the "search" variant of the DFA; but it could happen, and it would result in a session-lifespan memory leak since this code uses malloc() directly. Problem seems to have been aboriginal in Spencer's library, so back-patch all the way. In passing, correct a thinko in a comment I added awhile back about the meaning of the "ntree" field. I happened across these issues while comparing our code to Tcl's version of the library.
* Sync regex code with Tcl 8.6.4.Tom Lane2015-09-16
| | | | | | | | | | | | | | | | | | | Sync our regex code with upstream changes since last time we did this, which was Tcl 8.5.11 (see commit 08fd6ff37f71485e2fc04bc6ce07d2a483c36702). The only functional change here is to disbelieve that an octal escape is three digits long if it would exceed \377. That's a bug fix, but it's a minor one and could change the interpretation of working regexes, so don't back-patch. In addition to that, s/INFINITY/DUPINF/ to eliminate the risk of collisions with <math.h>'s macro, and s/LOCAL/NOPROP/ because that also seems like an unnecessarily collision-prone macro name. There were some other cosmetic changes in their copy that I did not adopt, notably a rather half-hearted attempt at renaming some of the C functions in a more verbose style. (I'm not necessarily against the concept, but renaming just a few functions in the package is not an improvement.)
* Add a bit more commentary about regex's colormap tree data structure.Tom Lane2015-05-24
| | | | Per an off-list question from Piotr Stefaniak.
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Allow regex operations to be terminated early by query cancel requests.Tom Lane2014-03-01
| | | | | | | | | | | | | | | | | | | | | | | | | The regex code didn't have any provision for query cancel; which is unsurprising given its non-Postgres origin, but still problematic since some operations can take a long time. Introduce a callback function to check for a pending query cancel or session termination request, and call it in a couple of strategic spots where we can make the regex code exit with an error indicator. If we ever actually split out the regex code as a standalone library, some additional work will be needed to let the cancel callback function be specified externally to the library. But that's straightforward (certainly so by comparison to putting the locale-dependent character classification logic on a similar arms-length basis), and there seems no need to do it right now. A bigger issue is that there may be more places than these two where we need to check for cancels. We can always add more checks later, now that the infrastructure is in place. Since there are known examples of not-terribly-long regexes that can lock up a backend for a long time, back-patch to all supported branches. I have hopes of fixing the known performance problems later, but adding query cancel ability seems like a good idea even if they were all fixed.
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
* Fix old typo in comment.Tom Lane2013-10-29
| | | | NFAs have children, but their individual states don't.
* Support indexing of regular-expression searches in contrib/pg_trgm.Tom Lane2013-04-09
| | | | | | | | | | | | | | | | This works by extracting trigrams from the given regular expression, in generally the same spirit as the previously-existing support for LIKE searches, though of course the details are far more complicated. Currently, only GIN indexes are supported. We might be able to make it work with GiST indexes later. The implementation includes adding API functions to backend/regex/ to provide a view of the search NFA created from a regular expression. These functions are meant to be generic enough to be supportable in a standalone version of the regex library, should that ever happen. Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
* Fix crash on compiling a regular expression with more than 32k colors.Heikki Linnakangas2013-04-04
| | | | | | Throw an error instead. Backpatch to all supported branches.