Re-implement extraction of fixed prefixes from regular expressions.

To generate btree-indexable conditions from regex WHERE conditions (such as WHERE indexed_col ~ '^foo'), we need to be able to identify any fixed prefix that a regex might have; that is, find any string that must be a prefix of all strings satisfying the regex. We used to do that with entirely ad-hoc code that looked at the source text of the regex. It didn't know very much about regex syntax, which mostly meant that it would fail to identify some optimizable cases; but Viktor Rosenfeld reported that it would produce actively wrong answers for quantified parenthesized subexpressions, such as '^(foo)?bar'. Rather than trying to extend the ad-hoc code to cover this, let's get rid of it altogether in favor of identifying prefixes by examining the compiled form of a regex. To do this, I've added a new entry point "pg_regprefix" to the regex library; hopefully it is defined in a sufficiently general fashion that it can remain in the library when/if that code gets split out as a standalone project. Since this bug has been there for a very long time, this fix needs to get back-patched. However it depends on some other recent commits (particularly the addition of wchar-to-database-encoding conversion), so I'll commit this separately and then go to work on back-porting the necessary fixes.
author: Tom Lane <tgl@sss.pgh.pa.us> 2012-07-10 14:54:37 -0400
committer: Tom Lane <tgl@sss.pgh.pa.us> 2012-07-10 14:54:37 -0400
commit: 628cbb50ba80c83917b07a7609ddec12cda172d0 (patch)
tree: 7008492921c90e6de7c431633e33624a597a8416 /src/backend/utils/adt/regexp.c
parent: 00dac6000d422033c3e8d191f01ee0e6525794c2 (diff)
download: postgresql-628cbb50ba80c83917b07a7609ddec12cda172d0.tar.gz
postgresql-628cbb50ba80c83917b07a7609ddec12cda172d0.zip
1 files changed, 65 insertions, 0 deletions
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 96c77078c8b..074142e7985 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -1170,3 +1170,68 @@ build_regexp_split_result(regexp_matches_ctx *splitctx)
 								   Int32GetDatum(startpos + 1));
 	}
 }
+
+/*
+ * regexp_fixed_prefix - extract fixed prefix, if any, for a regexp
+ *
+ * The result is NULL if there is no fixed prefix, else a palloc'd string.
+ * If it is an exact match, not just a prefix, *exact is returned as TRUE.
+ */
+char *
+regexp_fixed_prefix(text *text_re, bool case_insensitive, Oid collation,
+					bool *exact)
+{
+	char	   *result;
+	regex_t    *re;
+	int			cflags;
+	int			re_result;
+	pg_wchar   *str;
+	size_t		slen;
+	size_t		maxlen;
+	char		errMsg[100];
+
+	*exact = false;				/* default result */
+
+	/* Compile RE */
+	cflags = REG_ADVANCED;
+	if (case_insensitive)
+		cflags |= REG_ICASE;
+
+	re = RE_compile_and_cache(text_re, cflags, collation);
+
+	/* Examine it to see if there's a fixed prefix */
+	re_result = pg_regprefix(re, &str, &slen);
+
+	switch (re_result)
+	{
+		case REG_NOMATCH:
+			return NULL;
+
+		case REG_PREFIX:
+			/* continue with wchar conversion */
+			break;
+
+		case REG_EXACT:
+			*exact = true;
+			/* continue with wchar conversion */
+			break;
+
+		default:
+			/* re failed??? */
+			pg_regerror(re_result, re, errMsg, sizeof(errMsg));
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
+					 errmsg("regular expression failed: %s", errMsg)));
+			break;
+	}
+
+	/* Convert pg_wchar result back to database encoding */
+	maxlen = pg_database_encoding_max_length() * slen + 1;
+	result = (char *) palloc(maxlen);
+	slen = pg_wchar2mb_with_len(str, result, slen);
+	Assert(slen < maxlen);
+
+	free(str);
+
+	return result;
+}
author	Tom Lane <tgl@sss.pgh.pa.us>	2012-07-10 14:54:37 -0400
committer	Tom Lane <tgl@sss.pgh.pa.us>	2012-07-10 14:54:37 -0400
commit	628cbb50ba80c83917b07a7609ddec12cda172d0 (patch)
tree	7008492921c90e6de7c431633e33624a597a8416 /src/backend/utils/adt/regexp.c
parent	00dac6000d422033c3e8d191f01ee0e6525794c2 (diff)
download	postgresql-628cbb50ba80c83917b07a7609ddec12cda172d0.tar.gz postgresql-628cbb50ba80c83917b07a7609ddec12cda172d0.zip