diff options
Diffstat (limited to 'src/backend/regex')
-rw-r--r-- | src/backend/regex/README | 96 | ||||
-rw-r--r-- | src/backend/regex/regc_color.c | 885 | ||||
-rw-r--r-- | src/backend/regex/regc_cvec.c | 4 | ||||
-rw-r--r-- | src/backend/regex/regc_locale.c | 89 | ||||
-rw-r--r-- | src/backend/regex/regc_pg_locale.c | 36 | ||||
-rw-r--r-- | src/backend/regex/regcomp.c | 62 | ||||
-rw-r--r-- | src/backend/regex/regexport.c | 86 | ||||
-rw-r--r-- | src/backend/regex/regprefix.c | 5 |
8 files changed, 820 insertions, 443 deletions
diff --git a/src/backend/regex/README b/src/backend/regex/README index 6c9f48315e3..f08aab69e37 100644 --- a/src/backend/regex/README +++ b/src/backend/regex/README @@ -27,13 +27,14 @@ and similarly additional source files rege_*.c that are #include'd in regexec. This was done to avoid exposing internal symbols globally; all functions not meant to be part of the library API are static. -(Actually the above is a lie in one respect: there is one more global -symbol, pg_set_regex_collation in regcomp. It is not meant to be part of -the API, but it has to be global because both regcomp and regexec call it. -It'd be better to get rid of that, as well as the static variables it -sets, in favor of keeping the needed locale state in the regex structs. -We have not done this yet for lack of a design for how to add -application-specific state to the structs.) +(Actually the above is a lie in one respect: there are two more global +symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are +not meant to be part of the API, but they have to be global because both +regcomp and regexec call them. It'd be better to get rid of +pg_set_regex_collation, as well as the static variables it sets, in favor of +keeping the needed locale state in the regex structs. We have not done this +yet for lack of a design for how to add application-specific state to the +structs.) What's where in src/backend/regex/: @@ -274,28 +275,65 @@ colors: an existing color has to be subdivided. The last two of these are handled with the "struct colordesc" array and -the "colorchain" links in NFA arc structs. The color map proper (that -is, the per-character lookup array) is handled as a multi-level tree, -with each tree level indexed by one byte of a character's value. The -code arranges to not have more than one copy of bottom-level tree pages -that are all-the-same-color. - -Unfortunately, this design does not seem terribly efficient for common -cases such as a tree in which all Unicode letters are colored the same, -because there aren't that many places where we get a whole page all the -same color, except at the end of the map. (It also strikes me that given -PG's current restrictions on the range of Unicode values, we could use a -3-level rather than 4-level tree; but there's not provision for that in -regguts.h at the moment.) - -A bigger problem is that it just doesn't seem very reasonable to have to -consider each Unicode letter separately at regex parse time for a regex -such as "\w"; more than likely, a huge percentage of those codes will -never be seen at runtime. We need to fix things so that locale-based -character classes are somehow processed "symbolically" without making a -full expansion of their contents at parse time. This would mean that we'd -have to be ready to call iswalpha() at runtime, but if that only happens -for high-code-value characters, it shouldn't be a big performance hit. +the "colorchain" links in NFA arc structs. + +Ideally, we'd do the first two operations using a simple linear array +storing the current color assignment for each character code. +Unfortunately, that's not terribly workable for large charsets such as +Unicode. Our solution is to divide the color map into two parts. A simple +linear array is used for character codes up to MAX_SIMPLE_CHR, which can be +chosen large enough to include all popular characters (so that the +significantly-slower code paths about to be described are seldom invoked). +Characters above that need be considered at compile time only if they +appear explicitly in the regex pattern. We store each such mentioned +character or character range as an entry in the "colormaprange" array in +the colormap. (Overlapping ranges are split into unique subranges, so that +each range in the finished list needs only a single color that describes +all its characters.) When mapping a character above MAX_SIMPLE_CHR to a +color at runtime, we search this list of ranges explicitly. + +That's still not quite enough, though, because of locale-dependent +character classes such as [[:alpha:]]. In Unicode locales these classes +may have thousands of entries that are above MAX_SIMPLE_CHR, and we +certainly don't want to be searching large colormaprange arrays at runtime. +Nor do we even want to spend the time to initialize cvec structures that +exhaustively describe all of those characters. Our solution is to compute +exact per-character colors at regex compile time only up to MAX_SIMPLE_CHR. +For characters above that, we apply the <ctype.h> or <wctype.h> lookup +functions at runtime for each locale-dependent character class used in the +regex pattern, constructing a bitmap that describes which classes the +runtime character belongs to. The per-character-range data structure +mentioned above actually holds, for each range, a separate color entry +for each possible combination of character class properties. That is, +the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, +whose rows correspond to high characters or character ranges that are +explicitly mentioned in the regex pattern, and whose columns correspond +to sets of the locale-dependent character classes that are used in the +regex. + +As an example, given the pattern '\w\u1234[\U0001D100-\U0001D1FF]' +(and supposing that MAX_SIMPLE_CHR is less than 0x1234), we will need +a high color map with three rows. One row is for the single character +U+1234 (represented as a single-element range), one is for the range +U+1D100..U+1D1FF, and the other row represents all remaining high +characters. The color map has two columns, one for characters that +satisfy iswalnum() and one for those that don't. + +We build this color map in parallel with scanning the regex. Each time +we detect a new explicit high character (or range) or a locale-dependent +character class, we split existing entry(s) in the high color map so that +characters we need to be able to distinguish will have distinct entries +that can be given separate colors. Often, though, single entries in the +high color map will represent very large sets of characters. + +If there are both explicit high characters/ranges and locale-dependent +character classes, we may have entries in the high color map array that +have non-WHITE colors but don't actually represent any real characters. +(For example, in a row representing a singleton range, only one of the +columns could possibly be a live entry; it's the one matching the actual +locale properties for that single character.) We don't currently make +any effort to reclaim such colors. In principle it could be done, but +it's not clear that it's worth the trouble. Detailed semantics of an NFA diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c index 8ffc8fb797d..4b72894ad3a 100644 --- a/src/backend/regex/regc_color.c +++ b/src/backend/regex/regc_color.c @@ -49,10 +49,6 @@ static void initcm(struct vars * v, struct colormap * cm) { - int i; - int j; - union tree *t; - union tree *nextt; struct colordesc *cd; cm->magic = CMMAGIC; @@ -64,24 +60,40 @@ initcm(struct vars * v, cm->free = 0; cd = cm->cd; /* cm->cd[WHITE] */ + cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1; + cd->nuchrs = 1; cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; - cd->nchrs = CHR_MAX - CHR_MIN + 1; cd->flags = 0; - /* upper levels of tree */ - for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--) + cm->locolormap = (color *) + MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); + if (cm->locolormap == NULL) { - nextt = t + 1; - for (i = BYTTAB - 1; i >= 0; i--) - t->tptr[i] = nextt; + CERR(REG_ESPACE); + cm->cmranges = NULL; /* prevent failure during freecm */ + cm->hicolormap = NULL; + return; } - /* bottom level is solid white */ - t = &cm->tree[NBYTS - 1]; - for (i = BYTTAB - 1; i >= 0; i--) - t->tcolor[i] = WHITE; - cd->block = t; + /* this memset relies on WHITE being zero: */ + memset(cm->locolormap, WHITE, + (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); + + memset(cm->classbits, 0, sizeof(cm->classbits)); + cm->numcmranges = 0; + cm->cmranges = NULL; + cm->maxarrayrows = 4; /* arbitrary initial allocation */ + cm->hiarrayrows = 1; /* but we have only one row/col initially */ + cm->hiarraycols = 1; + cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color)); + if (cm->hicolormap == NULL) + { + CERR(REG_ESPACE); + return; + } + /* initialize the "all other characters" row to WHITE */ + cm->hicolormap[0] = WHITE; } /* @@ -90,117 +102,67 @@ initcm(struct vars * v, static void freecm(struct colormap * cm) { - size_t i; - union tree *cb; - cm->magic = 0; - if (NBYTS > 1) - cmtreefree(cm, cm->tree, 0); - for (i = 1; i <= cm->max; i++) /* skip WHITE */ - if (!UNUSEDCOLOR(&cm->cd[i])) - { - cb = cm->cd[i].block; - if (cb != NULL) - FREE(cb); - } if (cm->cd != cm->cdspace) FREE(cm->cd); + if (cm->locolormap != NULL) + FREE(cm->locolormap); + if (cm->cmranges != NULL) + FREE(cm->cmranges); + if (cm->hicolormap != NULL) + FREE(cm->hicolormap); } /* - * cmtreefree - free a non-terminal part of a colormap tree + * pg_reg_getcolor - slow case of GETCOLOR() */ -static void -cmtreefree(struct colormap * cm, - union tree * tree, - int level) /* level number (top == 0) of this block */ +color +pg_reg_getcolor(struct colormap * cm, chr c) { - int i; - union tree *t; - union tree *fillt = &cm->tree[level + 1]; - union tree *cb; - - assert(level < NBYTS - 1); /* this level has pointers */ - for (i = BYTTAB - 1; i >= 0; i--) + int rownum, + colnum, + low, + high; + + /* Should not be used for chrs in the locolormap */ + assert(c > MAX_SIMPLE_CHR); + + /* + * Find which row it's in. The colormapranges are in order, so we can use + * binary search. + */ + rownum = 0; /* if no match, use array row zero */ + low = 0; + high = cm->numcmranges; + while (low < high) { - t = tree->tptr[i]; - assert(t != NULL); - if (t != fillt) + int middle = low + (high - low) / 2; + const colormaprange *cmr = &cm->cmranges[middle]; + + if (c < cmr->cmin) + high = middle; + else if (c > cmr->cmax) + low = middle + 1; + else { - if (level < NBYTS - 2) - { /* more pointer blocks below */ - cmtreefree(cm, t, level + 1); - FREE(t); - } - else - { /* color block below */ - cb = cm->cd[t->tcolor[0]].block; - if (t != cb) /* not a solid block */ - FREE(t); - } + rownum = cmr->rownum; /* found a match */ + break; } } -} -/* - * setcolor - set the color of a character in a colormap - */ -static color /* previous color */ -setcolor(struct colormap * cm, - chr c, - color co) -{ - uchr uc = c; - int shift; - int level; - int b; - int bottom; - union tree *t; - union tree *newt; - union tree *fillt; - union tree *lastt; - union tree *cb; - color prev; - - assert(cm->magic == CMMAGIC); - if (CISERR() || co == COLORLESS) - return COLORLESS; - - t = cm->tree; - for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0; - level++, shift -= BYTBITS) - { - b = (uc >> shift) & BYTMASK; - lastt = t; - t = lastt->tptr[b]; - assert(t != NULL); - fillt = &cm->tree[level + 1]; - bottom = (shift <= BYTBITS) ? 1 : 0; - cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt; - if (t == fillt || t == cb) - { /* must allocate a new block */ - newt = (union tree *) MALLOC((bottom) ? - sizeof(struct colors) : sizeof(struct ptrs)); - if (newt == NULL) - { - CERR(REG_ESPACE); - return COLORLESS; - } - if (bottom) - memcpy(VS(newt->tcolor), VS(t->tcolor), - BYTTAB * sizeof(color)); - else - memcpy(VS(newt->tptr), VS(t->tptr), - BYTTAB * sizeof(union tree *)); - t = newt; - lastt->tptr[b] = t; - } + /* + * Find which column it's in --- this is all locale-dependent. + */ + if (cm->hiarraycols > 1) + { + colnum = cclass_column_index(cm, c); + return cm->hicolormap[rownum * cm->hiarraycols + colnum]; + } + else + { + /* fast path if no relevant cclasses */ + return cm->hicolormap[rownum]; } - - b = uc & BYTMASK; - prev = t->tcolor[b]; - t->tcolor[b] = co; - return prev; } /* @@ -216,7 +178,7 @@ maxcolor(struct colormap * cm) } /* - * newcolor - find a new color (must be subject of setcolor at once) + * newcolor - find a new color (must be assigned at once) * Beware: may relocate the colordescs. */ static color /* COLORLESS for error */ @@ -278,12 +240,12 @@ newcolor(struct colormap * cm) cd = &cm->cd[cm->max]; } - cd->nchrs = 0; + cd->nschrs = 0; + cd->nuchrs = 0; cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; /* in case never set otherwise */ cd->flags = 0; - cd->block = NULL; return (color) (cd - cm->cd); } @@ -305,13 +267,9 @@ freecolor(struct colormap * cm, assert(cd->arcs == NULL); assert(cd->sub == NOSUB); - assert(cd->nchrs == 0); + assert(cd->nschrs == 0); + assert(cd->nuchrs == 0); cd->flags = FREECOL; - if (cd->block != NULL) - { - FREE(cd->block); - cd->block = NULL; /* just paranoia */ - } if ((size_t) co == cm->max) { @@ -354,17 +312,25 @@ static color pseudocolor(struct colormap * cm) { color co; + struct colordesc *cd; co = newcolor(cm); if (CISERR()) return COLORLESS; - cm->cd[co].nchrs = 1; - cm->cd[co].flags = PSEUDO; + cd = &cm->cd[co]; + cd->nschrs = 0; + cd->nuchrs = 1; /* pretend it is in the upper map */ + cd->sub = NOSUB; + cd->arcs = NULL; + cd->firstchr = CHR_MIN; + cd->flags = PSEUDO; return co; } /* * subcolor - allocate a new subcolor (if necessary) to this chr + * + * This works only for chrs that map into the low color map. */ static color subcolor(struct colormap * cm, chr c) @@ -372,7 +338,9 @@ subcolor(struct colormap * cm, chr c) color co; /* current color of c */ color sco; /* new subcolor */ - co = GETCOLOR(cm, c); + assert(c <= MAX_SIMPLE_CHR); + + co = cm->locolormap[c - CHR_MIN]; sco = newsub(cm, co); if (CISERR()) return COLORLESS; @@ -380,11 +348,37 @@ subcolor(struct colormap * cm, chr c) if (co == sco) /* already in an open subcolor */ return co; /* rest is redundant */ - cm->cd[co].nchrs--; - if (cm->cd[sco].nchrs == 0) + cm->cd[co].nschrs--; + if (cm->cd[sco].nschrs == 0) cm->cd[sco].firstchr = c; - cm->cd[sco].nchrs++; - setcolor(cm, c, sco); + cm->cd[sco].nschrs++; + cm->locolormap[c - CHR_MIN] = sco; + return sco; +} + +/* + * subcolorhi - allocate a new subcolor (if necessary) to this colormap entry + * + * This is the same processing as subcolor(), but for entries in the high + * colormap, which do not necessarily correspond to exactly one chr code. + */ +static color +subcolorhi(struct colormap * cm, color *pco) +{ + color co; /* current color of entry */ + color sco; /* new subcolor */ + + co = *pco; + sco = newsub(cm, co); + if (CISERR()) + return COLORLESS; + assert(sco != COLORLESS); + + if (co == sco) /* already in an open subcolor */ + return co; /* rest is redundant */ + cm->cd[co].nuchrs--; + cm->cd[sco].nuchrs++; + *pco = sco; return sco; } @@ -400,7 +394,8 @@ newsub(struct colormap * cm, sco = cm->cd[co].sub; if (sco == NOSUB) { /* color has no open subcolor */ - if (cm->cd[co].nchrs == 1) /* optimization */ + /* optimization: singly-referenced color need not be subcolored */ + if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1) return co; sco = newcolor(cm); /* must create subcolor */ if (sco == COLORLESS) @@ -417,136 +412,500 @@ newsub(struct colormap * cm, } /* - * subrange - allocate new subcolors to this range of chrs, fill in arcs + * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow + * + * Returns array index of new row. Note the array might move. */ -static void -subrange(struct vars * v, - chr from, - chr to, - struct state * lp, - struct state * rp) +static int +newhicolorrow(struct colormap * cm, + int oldrow) { - uchr uf; + int newrow = cm->hiarrayrows; + color *newrowptr; int i; - assert(from <= to); + /* Assign a fresh array row index, enlarging storage if needed */ + if (newrow >= cm->maxarrayrows) + { + color *newarray; + + if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2)) + { + CERR(REG_ESPACE); + return 0; + } + newarray = (color *) REALLOC(cm->hicolormap, + cm->maxarrayrows * 2 * + cm->hiarraycols * sizeof(color)); + if (newarray == NULL) + { + CERR(REG_ESPACE); + return 0; + } + cm->hicolormap = newarray; + cm->maxarrayrows *= 2; + } + cm->hiarrayrows++; + + /* Copy old row data */ + newrowptr = &cm->hicolormap[newrow * cm->hiarraycols]; + memcpy(newrowptr, + &cm->hicolormap[oldrow * cm->hiarraycols], + cm->hiarraycols * sizeof(color)); + + /* Increase color reference counts to reflect new colormap entries */ + for (i = 0; i < cm->hiarraycols; i++) + cm->cd[newrowptr[i]].nuchrs++; - /* first, align "from" on a tree-block boundary */ - uf = (uchr) from; - i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf); - for (; from <= to && i > 0; i--, from++) - newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp); - if (from > to) /* didn't reach a boundary */ + return newrow; +} + +/* + * newhicolorcols - create a new set of columns in the high colormap + * + * Essentially, extends the 2-D array to the right with a copy of itself. + */ +static void +newhicolorcols(struct colormap * cm) +{ + color *newarray; + int r, + c; + + if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2)) + { + CERR(REG_ESPACE); return; + } + newarray = (color *) REALLOC(cm->hicolormap, + cm->maxarrayrows * + cm->hiarraycols * 2 * sizeof(color)); + if (newarray == NULL) + { + CERR(REG_ESPACE); + return; + } + cm->hicolormap = newarray; - /* deal with whole blocks */ - for (; to - from >= BYTTAB; from += BYTTAB) - subblock(v, from, lp, rp); + /* Duplicate existing columns to the right, and increase ref counts */ + /* Must work backwards in the array because we realloc'd in place */ + for (r = cm->hiarrayrows - 1; r >= 0; r--) + { + color *oldrowptr = &newarray[r * cm->hiarraycols]; + color *newrowptr = &newarray[r * cm->hiarraycols * 2]; + color *newrowptr2 = newrowptr + cm->hiarraycols; + + for (c = 0; c < cm->hiarraycols; c++) + { + color co = oldrowptr[c]; - /* clean up any remaining partial table */ - for (; from <= to; from++) - newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp); + newrowptr[c] = newrowptr2[c] = co; + cm->cd[co].nuchrs++; + } + } + + cm->hiarraycols *= 2; } /* - * subblock - allocate new subcolors for one tree block of chrs, fill in arcs + * subcolorcvec - allocate new subcolors to cvec members, fill in arcs * - * Note: subcolors that are created during execution of this function - * will not be given a useful value of firstchr; it'll be left as CHR_MIN. - * For the current usage of firstchr in pg_regprefix, this does not matter - * because such subcolors won't occur in the common prefix of a regex. + * For each chr "c" represented by the cvec, do the equivalent of + * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp); + * + * Note that in typical cases, many of the subcolors are the same. + * While newarc() would discard duplicate arc requests, we can save + * some cycles by not calling it repetitively to begin with. This is + * mechanized with the "lastsubcolor" state variable. */ static void -subblock(struct vars * v, - chr start, /* first of BYTTAB chrs */ - struct state * lp, - struct state * rp) +subcolorcvec(struct vars * v, + struct cvec * cv, + struct state * lp, + struct state * rp) { - uchr uc = start; struct colormap *cm = v->cm; - int shift; - int level; + color lastsubcolor = COLORLESS; + chr ch, + from, + to; + const chr *p; int i; - int b; - union tree *t; - union tree *cb; - union tree *fillt; - union tree *lastt; - int previ; - int ndone; - color co; - color sco; - assert((uc % BYTTAB) == 0); - - /* find its color block, making new pointer blocks as needed */ - t = cm->tree; - fillt = NULL; - for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0; - level++, shift -= BYTBITS) - { - b = (uc >> shift) & BYTMASK; - lastt = t; - t = lastt->tptr[b]; - assert(t != NULL); - fillt = &cm->tree[level + 1]; - if (t == fillt && shift > BYTBITS) - { /* need new ptr block */ - t = (union tree *) MALLOC(sizeof(struct ptrs)); - if (t == NULL) + /* ordinary characters */ + for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--) + { + ch = *p; + subcoloronechr(v, ch, lp, rp, &lastsubcolor); + NOERR(); + } + + /* and the ranges */ + for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--) + { + from = *p; + to = *(p + 1); + if (from <= MAX_SIMPLE_CHR) + { + /* deal with simple chars one at a time */ + chr lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR; + + while (from <= lim) { - CERR(REG_ESPACE); - return; + color sco = subcolor(cm, from); + + NOERR(); + if (sco != lastsubcolor) + { + newarc(v->nfa, PLAIN, sco, lp, rp); + NOERR(); + lastsubcolor = sco; + } + from++; } - memcpy(VS(t->tptr), VS(fillt->tptr), - BYTTAB * sizeof(union tree *)); - lastt->tptr[b] = t; } + /* deal with any part of the range that's above MAX_SIMPLE_CHR */ + if (from < to) + subcoloronerange(v, from, to, lp, rp, &lastsubcolor); + else if (from == to) + subcoloronechr(v, from, lp, rp, &lastsubcolor); + NOERR(); } - /* special cases: fill block or solid block */ - co = t->tcolor[0]; - cb = cm->cd[co].block; - if (t == fillt || t == cb) + /* and deal with cclass if any */ + if (cv->cclasscode >= 0) { - /* either way, we want a subcolor solid block */ - sco = newsub(cm, co); - t = cm->cd[sco].block; - if (t == NULL) - { /* must set it up */ - t = (union tree *) MALLOC(sizeof(struct colors)); - if (t == NULL) + int classbit; + color *pco; + int r, + c; + + /* Enlarge array if we don't have a column bit assignment for cclass */ + if (cm->classbits[cv->cclasscode] == 0) + { + cm->classbits[cv->cclasscode] = cm->hiarraycols; + newhicolorcols(cm); + NOERR(); + } + /* Apply subcolorhi() and make arc for each entry in relevant cols */ + classbit = cm->classbits[cv->cclasscode]; + pco = cm->hicolormap; + for (r = 0; r < cm->hiarrayrows; r++) + { + for (c = 0; c < cm->hiarraycols; c++) { - CERR(REG_ESPACE); - return; + if (c & classbit) + { + color sco = subcolorhi(cm, pco); + + NOERR(); + /* add the arc if needed */ + if (sco != lastsubcolor) + { + newarc(v->nfa, PLAIN, sco, lp, rp); + NOERR(); + lastsubcolor = sco; + } + } + pco++; } - for (i = 0; i < BYTTAB; i++) - t->tcolor[i] = sco; - cm->cd[sco].block = t; } - /* find loop must have run at least once */ - lastt->tptr[b] = t; - newarc(v->nfa, PLAIN, sco, lp, rp); - cm->cd[co].nchrs -= BYTTAB; - cm->cd[sco].nchrs += BYTTAB; + } +} + +/* + * subcoloronechr - do subcolorcvec's work for a singleton chr + * + * We could just let subcoloronerange do this, but it's a bit more efficient + * if we exploit the single-chr case. Also, callers find it useful for this + * to be able to handle both low and high chr codes. + */ +static void +subcoloronechr(struct vars * v, + chr ch, + struct state * lp, + struct state * rp, + color *lastsubcolor) +{ + struct colormap *cm = v->cm; + colormaprange *newranges; + int numnewranges; + colormaprange *oldrange; + int oldrangen; + int newrow; + + /* Easy case for low chr codes */ + if (ch <= MAX_SIMPLE_CHR) + { + color sco = subcolor(cm, ch); + + NOERR(); + if (sco != *lastsubcolor) + { + newarc(v->nfa, PLAIN, sco, lp, rp); + *lastsubcolor = sco; + } + return; + } + + /* + * Potentially, we could need two more colormapranges than we have now, if + * the given chr is in the middle of some existing range. + */ + newranges = (colormaprange *) + MALLOC((cm->numcmranges + 2) * sizeof(colormaprange)); + if (newranges == NULL) + { + CERR(REG_ESPACE); return; } + numnewranges = 0; - /* general case, a mixed block to be altered */ - i = 0; - while (i < BYTTAB) + /* Ranges before target are unchanged */ + for (oldrange = cm->cmranges, oldrangen = 0; + oldrangen < cm->numcmranges; + oldrange++, oldrangen++) + { + if (oldrange->cmax >= ch) + break; + newranges[numnewranges++] = *oldrange; + } + + /* Match target chr against current range */ + if (oldrangen >= cm->numcmranges || oldrange->cmin > ch) + { + /* chr does not belong to any existing range, make a new one */ + newranges[numnewranges].cmin = ch; + newranges[numnewranges].cmax = ch; + /* row state should be cloned from the "all others" row */ + newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); + numnewranges++; + } + else if (oldrange->cmin == oldrange->cmax) + { + /* we have an existing singleton range matching the chr */ + newranges[numnewranges++] = *oldrange; + newrow = oldrange->rownum; + /* we've now fully processed this old range */ + oldrange++, oldrangen++; + } + else { - co = t->tcolor[i]; - sco = newsub(cm, co); - newarc(v->nfa, PLAIN, sco, lp, rp); - previ = i; - do + /* chr is a subset of this existing range, must split it */ + if (ch > oldrange->cmin) + { + /* emit portion of old range before chr */ + newranges[numnewranges].cmin = oldrange->cmin; + newranges[numnewranges].cmax = ch - 1; + newranges[numnewranges].rownum = oldrange->rownum; + numnewranges++; + } + /* emit chr as singleton range, initially cloning from range */ + newranges[numnewranges].cmin = ch; + newranges[numnewranges].cmax = ch; + newranges[numnewranges].rownum = newrow = + newhicolorrow(cm, oldrange->rownum); + numnewranges++; + if (ch < oldrange->cmax) { - t->tcolor[i++] = sco; - } while (i < BYTTAB && t->tcolor[i] == co); - ndone = i - previ; - cm->cd[co].nchrs -= ndone; - cm->cd[sco].nchrs += ndone; + /* emit portion of old range after chr */ + newranges[numnewranges].cmin = ch + 1; + newranges[numnewranges].cmax = oldrange->cmax; + /* must clone the row if we are making two new ranges from old */ + newranges[numnewranges].rownum = + (ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) : + oldrange->rownum; + numnewranges++; + } + /* we've now fully processed this old range */ + oldrange++, oldrangen++; + } + + /* Update colors in newrow and create arcs as needed */ + subcoloronerow(v, newrow, lp, rp, lastsubcolor); + + /* Ranges after target are unchanged */ + for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++) + { + newranges[numnewranges++] = *oldrange; + } + + /* Assert our original space estimate was adequate */ + assert(numnewranges <= (cm->numcmranges + 2)); + + /* And finally, store back the updated list of ranges */ + if (cm->cmranges != NULL) + FREE(cm->cmranges); + cm->cmranges = newranges; + cm->numcmranges = numnewranges; +} + +/* + * subcoloronerange - do subcolorcvec's work for a high range + */ +static void +subcoloronerange(struct vars * v, + chr from, + chr to, + struct state * lp, + struct state * rp, + color *lastsubcolor) +{ + struct colormap *cm = v->cm; + colormaprange *newranges; + int numnewranges; + colormaprange *oldrange; + int oldrangen; + int newrow; + + /* Caller should take care of non-high-range cases */ + assert(from > MAX_SIMPLE_CHR); + assert(from < to); + + /* + * Potentially, if we have N non-adjacent ranges, we could need as many as + * 2N+1 result ranges (consider case where new range spans 'em all). + */ + newranges = (colormaprange *) + MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange)); + if (newranges == NULL) + { + CERR(REG_ESPACE); + return; + } + numnewranges = 0; + + /* Ranges before target are unchanged */ + for (oldrange = cm->cmranges, oldrangen = 0; + oldrangen < cm->numcmranges; + oldrange++, oldrangen++) + { + if (oldrange->cmax >= from) + break; + newranges[numnewranges++] = *oldrange; + } + + /* + * Deal with ranges that (partially) overlap the target. As we process + * each such range, increase "from" to remove the dealt-with characters + * from the target range. + */ + while (oldrangen < cm->numcmranges && oldrange->cmin <= to) + { + if (from < oldrange->cmin) + { + /* Handle portion of new range that corresponds to no old range */ + newranges[numnewranges].cmin = from; + newranges[numnewranges].cmax = oldrange->cmin - 1; + /* row state should be cloned from the "all others" row */ + newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); + numnewranges++; + /* Update colors in newrow and create arcs as needed */ + subcoloronerow(v, newrow, lp, rp, lastsubcolor); + /* We've now fully processed the part of new range before old */ + from = oldrange->cmin; + } + + if (from <= oldrange->cmin && to >= oldrange->cmax) + { + /* old range is fully contained in new, process it in-place */ + newranges[numnewranges++] = *oldrange; + newrow = oldrange->rownum; + from = oldrange->cmax + 1; + } + else + { + /* some part of old range does not overlap new range */ + if (from > oldrange->cmin) + { + /* emit portion of old range before new range */ + newranges[numnewranges].cmin = oldrange->cmin; + newranges[numnewranges].cmax = from - 1; + newranges[numnewranges].rownum = oldrange->rownum; + numnewranges++; + } + /* emit common subrange, initially cloning from old range */ + newranges[numnewranges].cmin = from; + newranges[numnewranges].cmax = + (to < oldrange->cmax) ? to : oldrange->cmax; + newranges[numnewranges].rownum = newrow = + newhicolorrow(cm, oldrange->rownum); + numnewranges++; + if (to < oldrange->cmax) + { + /* emit portion of old range after new range */ + newranges[numnewranges].cmin = to + 1; + newranges[numnewranges].cmax = oldrange->cmax; + /* must clone the row if we are making two new ranges from old */ + newranges[numnewranges].rownum = + (from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) : + oldrange->rownum; + numnewranges++; + } + from = oldrange->cmax + 1; + } + /* Update colors in newrow and create arcs as needed */ + subcoloronerow(v, newrow, lp, rp, lastsubcolor); + /* we've now fully processed this old range */ + oldrange++, oldrangen++; + } + + if (from <= to) + { + /* Handle portion of new range that corresponds to no old range */ + newranges[numnewranges].cmin = from; + newranges[numnewranges].cmax = to; + /* row state should be cloned from the "all others" row */ + newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); + numnewranges++; + /* Update colors in newrow and create arcs as needed */ + subcoloronerow(v, newrow, lp, rp, lastsubcolor); + } + + /* Ranges after target are unchanged */ + for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++) + { + newranges[numnewranges++] = *oldrange; + } + + /* Assert our original space estimate was adequate */ + assert(numnewranges <= (cm->numcmranges * 2 + 1)); + + /* And finally, store back the updated list of ranges */ + if (cm->cmranges != NULL) + FREE(cm->cmranges); + cm->cmranges = newranges; + cm->numcmranges = numnewranges; +} + +/* + * subcoloronerow - do subcolorcvec's work for one new row in the high colormap + */ +static void +subcoloronerow(struct vars * v, + int rownum, + struct state * lp, + struct state * rp, + color *lastsubcolor) +{ + struct colormap *cm = v->cm; + color *pco; + int i; + + /* Apply subcolorhi() and make arc for each entry in row */ + pco = &cm->hicolormap[rownum * cm->hiarraycols]; + for (i = 0; i < cm->hiarraycols; pco++, i++) + { + color sco = subcolorhi(cm, pco); + + NOERR(); + /* make the arc if needed */ + if (sco != *lastsubcolor) + { + newarc(v->nfa, PLAIN, sco, lp, rp); + NOERR(); + *lastsubcolor = sco; + } } } @@ -575,12 +934,12 @@ okcolors(struct nfa * nfa, { /* is subcolor, let parent deal with it */ } - else if (cd->nchrs == 0) + else if (cd->nschrs == 0 && cd->nuchrs == 0) { /* parent empty, its arcs change color to subcolor */ cd->sub = NOSUB; scd = &cm->cd[sco]; - assert(scd->nchrs > 0); + assert(scd->nschrs > 0 || scd->nuchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; while ((a = cd->arcs) != NULL) @@ -597,7 +956,7 @@ okcolors(struct nfa * nfa, /* parent's arcs must gain parallel subcolor arcs */ cd->sub = NOSUB; scd = &cm->cd[sco]; - assert(scd->nchrs > 0); + assert(scd->nschrs > 0 || scd->nuchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; for (a = cd->arcs; a != NULL; a = a->colorchain) @@ -711,62 +1070,54 @@ dumpcolors(struct colormap * cm, struct colordesc *end; color co; chr c; - char *has; fprintf(f, "max %ld\n", (long) cm->max); - if (NBYTS > 1) - fillcheck(cm, cm->tree, 0, f); end = CDEND(cm); for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */ + { if (!UNUSEDCOLOR(cd)) { - assert(cd->nchrs > 0); - has = (cd->block != NULL) ? "#" : ""; + assert(cd->nschrs > 0 || cd->nuchrs > 0); if (cd->flags & PSEUDO) - fprintf(f, "#%2ld%s(ps): ", (long) co, has); + fprintf(f, "#%2ld(ps): ", (long) co); else - fprintf(f, "#%2ld%s(%2d): ", (long) co, - has, cd->nchrs); + fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs); /* * Unfortunately, it's hard to do this next bit more efficiently. - * - * Spencer's original coding has the loop iterating from CHR_MIN - * to CHR_MAX, but that's utterly unusable for 32-bit chr. For - * debugging purposes it seems fine to print only chr codes up to - * 1000 or so. */ - for (c = CHR_MIN; c < 1000; c++) + for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++) if (GETCOLOR(cm, c) == co) dumpchr(c, f); fprintf(f, "\n"); } -} - -/* - * fillcheck - check proper filling of a tree - */ -static void -fillcheck(struct colormap * cm, - union tree * tree, - int level, /* level number (top == 0) of this block */ - FILE *f) -{ - int i; - union tree *t; - union tree *fillt = &cm->tree[level + 1]; - - assert(level < NBYTS - 1); /* this level has pointers */ - for (i = BYTTAB - 1; i >= 0; i--) + } + /* dump the high colormap if it contains anything interesting */ + if (cm->hiarrayrows > 1 || cm->hiarraycols > 1) { - t = tree->tptr[i]; - if (t == NULL) - fprintf(f, "NULL found in filled tree!\n"); - else if (t == fillt) + int r, + c; + const color *rowptr; + + fprintf(f, "other:\t"); + for (c = 0; c < cm->hiarraycols; c++) + { + fprintf(f, "\t%ld", (long) cm->hicolormap[c]); + } + fprintf(f, "\n"); + for (r = 0; r < cm->numcmranges; r++) { + dumpchr(cm->cmranges[r].cmin, f); + fprintf(f, ".."); + dumpchr(cm->cmranges[r].cmax, f); + fprintf(f, ":"); + rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols]; + for (c = 0; c < cm->hiarraycols; c++) + { + fprintf(f, "\t%ld", (long) rowptr[c]); + } + fprintf(f, "\n"); } - else if (level < NBYTS - 2) /* more pointer blocks below */ - fillcheck(cm, t, level + 1, f); } } diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c index 3a9e8cfbbd7..50b7a4574b4 100644 --- a/src/backend/regex/regc_cvec.c +++ b/src/backend/regex/regc_cvec.c @@ -34,7 +34,8 @@ /* * Notes: - * Only (selected) functions in _this_ file should treat chr* as non-constant. + * Only (selected) functions in _this_ file should treat the chr arrays + * of a cvec as non-constant. */ /* @@ -67,6 +68,7 @@ clearcvec(struct cvec * cv) assert(cv != NULL); cv->nchrs = 0; cv->nranges = 0; + cv->cclasscode = -1; return cv; } diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c index 399de027cdd..7cb3a40a0c8 100644 --- a/src/backend/regex/regc_locale.c +++ b/src/backend/regex/regc_locale.c @@ -349,6 +349,19 @@ static const struct cname } }; +/* + * The following arrays define the valid character class names. + */ +static const char *const classNames[NUM_CCLASSES + 1] = { + "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph", + "lower", "print", "punct", "space", "upper", "xdigit", NULL +}; + +enum classes +{ + CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH, + CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT +}; /* * We do not use the hard-wired Unicode classification tables that Tcl does. @@ -544,21 +557,6 @@ cclass(struct vars * v, /* context */ index; /* - * The following arrays define the valid character class names. - */ - - static const char *const classNames[] = { - "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph", - "lower", "print", "punct", "space", "upper", "xdigit", NULL - }; - - enum classes - { - CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH, - CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT - }; - - /* * Map the name to the corresponding enumerated value. */ len = endp - startp; @@ -593,18 +591,20 @@ cclass(struct vars * v, /* context */ * pg_ctype_get_cache so that we can cache the results. Other classes * have definitions that are hard-wired here, and for those we just * construct a transient cvec on the fly. + * + * NB: keep this code in sync with cclass_column_index(), below. */ switch ((enum classes) index) { case CC_PRINT: - cv = pg_ctype_get_cache(pg_wc_isprint); + cv = pg_ctype_get_cache(pg_wc_isprint, index); break; case CC_ALNUM: - cv = pg_ctype_get_cache(pg_wc_isalnum); + cv = pg_ctype_get_cache(pg_wc_isalnum, index); break; case CC_ALPHA: - cv = pg_ctype_get_cache(pg_wc_isalpha); + cv = pg_ctype_get_cache(pg_wc_isalpha, index); break; case CC_ASCII: /* hard-wired meaning */ @@ -625,10 +625,10 @@ cclass(struct vars * v, /* context */ addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: - cv = pg_ctype_get_cache(pg_wc_isdigit); + cv = pg_ctype_get_cache(pg_wc_isdigit, index); break; case CC_PUNCT: - cv = pg_ctype_get_cache(pg_wc_ispunct); + cv = pg_ctype_get_cache(pg_wc_ispunct, index); break; case CC_XDIGIT: @@ -646,16 +646,16 @@ cclass(struct vars * v, /* context */ } break; case CC_SPACE: - cv = pg_ctype_get_cache(pg_wc_isspace); + cv = pg_ctype_get_cache(pg_wc_isspace, index); break; case CC_LOWER: - cv = pg_ctype_get_cache(pg_wc_islower); + cv = pg_ctype_get_cache(pg_wc_islower, index); break; case CC_UPPER: - cv = pg_ctype_get_cache(pg_wc_isupper); + cv = pg_ctype_get_cache(pg_wc_isupper, index); break; case CC_GRAPH: - cv = pg_ctype_get_cache(pg_wc_isgraph); + cv = pg_ctype_get_cache(pg_wc_isgraph, index); break; } @@ -666,6 +666,47 @@ cclass(struct vars * v, /* context */ } /* + * cclass_column_index - get appropriate high colormap column index for chr + */ +static int +cclass_column_index(struct colormap * cm, chr c) +{ + int colnum = 0; + + /* Shouldn't go through all these pushups for simple chrs */ + assert(c > MAX_SIMPLE_CHR); + + /* + * Note: we should not see requests to consider cclasses that are not + * treated as locale-specific by cclass(), above. + */ + if (cm->classbits[CC_PRINT] && pg_wc_isprint(c)) + colnum |= cm->classbits[CC_PRINT]; + if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c)) + colnum |= cm->classbits[CC_ALNUM]; + if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c)) + colnum |= cm->classbits[CC_ALPHA]; + assert(cm->classbits[CC_ASCII] == 0); + assert(cm->classbits[CC_BLANK] == 0); + assert(cm->classbits[CC_CNTRL] == 0); + if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c)) + colnum |= cm->classbits[CC_DIGIT]; + if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c)) + colnum |= cm->classbits[CC_PUNCT]; + assert(cm->classbits[CC_XDIGIT] == 0); + if (cm->classbits[CC_SPACE] && pg_wc_isspace(c)) + colnum |= cm->classbits[CC_SPACE]; + if (cm->classbits[CC_LOWER] && pg_wc_islower(c)) + colnum |= cm->classbits[CC_LOWER]; + if (cm->classbits[CC_UPPER] && pg_wc_isupper(c)) + colnum |= cm->classbits[CC_UPPER]; + if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c)) + colnum |= cm->classbits[CC_GRAPH]; + + return colnum; +} + +/* * allcases - supply cvec for all case counterparts of a chr (including itself) * * This is a shortcut, preferably an efficient one, for simple characters; diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c index 551ae7dc08e..ad9d4b1961d 100644 --- a/src/backend/regex/regc_pg_locale.c +++ b/src/backend/regex/regc_pg_locale.c @@ -736,7 +736,7 @@ store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs) * Note that the result must not be freed or modified by caller. */ static struct cvec * -pg_ctype_get_cache(pg_wc_probefunc probefunc) +pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode) { pg_ctype_cache *pcc; pg_wchar max_chr; @@ -770,31 +770,43 @@ pg_ctype_get_cache(pg_wc_probefunc probefunc) pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2); if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL) goto out_of_memory; + pcc->cv.cclasscode = cclasscode; /* - * Decide how many character codes we ought to look through. For C locale - * there's no need to go further than 127. Otherwise, if the encoding is - * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot - * extend it as far as we'd like (say, 0xFFFF, the end of the Basic - * Multilingual Plane) without creating significant performance issues due - * to too many characters being fed through the colormap code. This will - * need redesign to fix reasonably, but at least for the moment we have - * all common European languages covered. Otherwise (not C, not UTF8) go - * up to 255. These limits are interrelated with restrictions discussed - * at the head of this file. + * Decide how many character codes we ought to look through. In general + * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at + * runtime using the "high colormap" mechanism. However, in C locale + * there's no need to go further than 127, and if we only have a 1-byte + * <ctype.h> API there's no need to go further than that can handle. + * + * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the + * output cvec as not having any locale-dependent behavior, since there + * will be no need to do any run-time locale checks. (The #if's here + * would always be true for production values of MAX_SIMPLE_CHR, but it's + * useful to allow it to be small for testing purposes.) */ switch (pg_regex_strategy) { case PG_REGEX_LOCALE_C: +#if MAX_SIMPLE_CHR >= 127 max_chr = (pg_wchar) 127; + pcc->cv.cclasscode = -1; +#else + max_chr = (pg_wchar) MAX_SIMPLE_CHR; +#endif break; case PG_REGEX_LOCALE_WIDE: case PG_REGEX_LOCALE_WIDE_L: - max_chr = (pg_wchar) 0x7FF; + max_chr = (pg_wchar) MAX_SIMPLE_CHR; break; case PG_REGEX_LOCALE_1BYTE: case PG_REGEX_LOCALE_1BYTE_L: +#if MAX_SIMPLE_CHR >= UCHAR_MAX max_chr = (pg_wchar) UCHAR_MAX; + pcc->cv.cclasscode = -1; +#else + max_chr = (pg_wchar) MAX_SIMPLE_CHR; +#endif break; default: max_chr = 0; /* can't get here, but keep compiler quiet */ diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index b211cc0a189..ed95474b3f0 100644 --- a/src/backend/regex/regcomp.c +++ b/src/backend/regex/regcomp.c @@ -55,7 +55,6 @@ static void cbracket(struct vars *, struct state *, struct state *); static void brackpart(struct vars *, struct state *, struct state *); static const chr *scanplain(struct vars *); static void onechr(struct vars *, chr, struct state *, struct state *); -static void dovec(struct vars *, struct cvec *, struct state *, struct state *); static void wordchrs(struct vars *); static void processlacon(struct vars *, struct state *, struct state *, int, struct state *, struct state *); @@ -96,16 +95,19 @@ static chr chrnamed(struct vars *, const chr *, const chr *, chr); /* === regc_color.c === */ static void initcm(struct vars *, struct colormap *); static void freecm(struct colormap *); -static void cmtreefree(struct colormap *, union tree *, int); -static color setcolor(struct colormap *, chr, color); static color maxcolor(struct colormap *); static color newcolor(struct colormap *); static void freecolor(struct colormap *, color); static color pseudocolor(struct colormap *); -static color subcolor(struct colormap *, chr c); +static color subcolor(struct colormap *, chr); +static color subcolorhi(struct colormap *, color *); static color newsub(struct colormap *, color); -static void subrange(struct vars *, chr, chr, struct state *, struct state *); -static void subblock(struct vars *, chr, struct state *, struct state *); +static int newhicolorrow(struct colormap *, int); +static void newhicolorcols(struct colormap *); +static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *); +static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *); +static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *); +static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *); static void okcolors(struct nfa *, struct colormap *); static void colorchain(struct colormap *, struct arc *); static void uncolorchain(struct colormap *, struct arc *); @@ -114,7 +116,6 @@ static void colorcomplement(struct nfa *, struct colormap *, int, struct state * #ifdef REG_DEBUG static void dumpcolors(struct colormap *, FILE *); -static void fillcheck(struct colormap *, union tree *, int, FILE *); static void dumpchr(chr, FILE *); #endif /* === regc_nfa.c === */ @@ -215,6 +216,7 @@ static struct cvec *range(struct vars *, chr, chr, int); static int before(chr, chr); static struct cvec *eclass(struct vars *, chr, int); static struct cvec *cclass(struct vars *, const chr *, const chr *, int); +static int cclass_column_index(struct colormap *, chr); static struct cvec *allcases(struct vars *, chr); static int cmp(const chr *, const chr *, size_t); static int casecmp(const chr *, const chr *, size_t); @@ -1467,7 +1469,7 @@ brackpart(struct vars * v, NOERR(); cv = eclass(v, startc, (v->cflags & REG_ICASE)); NOERR(); - dovec(v, cv, lp, rp); + subcolorcvec(v, cv, lp, rp); return; break; case CCLASS: @@ -1477,7 +1479,7 @@ brackpart(struct vars * v, NOERR(); cv = cclass(v, startp, endp, (v->cflags & REG_ICASE)); NOERR(); - dovec(v, cv, lp, rp); + subcolorcvec(v, cv, lp, rp); return; break; default: @@ -1523,7 +1525,7 @@ brackpart(struct vars * v, NOTE(REG_UUNPORT); cv = range(v, startc, endc, (v->cflags & REG_ICASE)); NOERR(); - dovec(v, cv, lp, rp); + subcolorcvec(v, cv, lp, rp); } /* @@ -1565,46 +1567,14 @@ onechr(struct vars * v, { if (!(v->cflags & REG_ICASE)) { - newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp); + color lastsubcolor = COLORLESS; + + subcoloronechr(v, c, lp, rp, &lastsubcolor); return; } /* rats, need general case anyway... */ - dovec(v, allcases(v, c), lp, rp); -} - -/* - * dovec - fill in arcs for each element of a cvec - */ -static void -dovec(struct vars * v, - struct cvec * cv, - struct state * lp, - struct state * rp) -{ - chr ch, - from, - to; - const chr *p; - int i; - - /* ordinary characters */ - for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--) - { - ch = *p; - newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp); - NOERR(); - } - - /* and the ranges */ - for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--) - { - from = *p; - to = *(p + 1); - if (from <= to) - subrange(v, from, to, lp, rp); - NOERR(); - } + subcolorcvec(v, allcases(v, c), lp, rp); } /* diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c index 93da82286f3..a515949bb4d 100644 --- a/src/backend/regex/regexport.c +++ b/src/backend/regex/regexport.c @@ -28,10 +28,6 @@ #include "regex/regexport.h" -static void scancolormap(struct colormap * cm, int co, - union tree * t, int level, chr partial, - pg_wchar **chars, int *chars_len); - /* * Get total number of NFA states. @@ -187,10 +183,7 @@ pg_reg_colorisend(const regex_t *regex, int co) * * Note: we return -1 if the color number is invalid, or if it is a special * color (WHITE or a pseudocolor), or if the number of members is uncertain. - * The latter case cannot arise right now but is specified to allow for future - * improvements (see musings about run-time handling of higher character codes - * in regex/README). Callers should not try to extract the members if -1 is - * returned. + * Callers should not try to extract the members if -1 is returned. */ int pg_reg_getnumcharacters(const regex_t *regex, int co) @@ -205,7 +198,18 @@ pg_reg_getnumcharacters(const regex_t *regex, int co) if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */ return -1; - return cm->cd[co].nchrs; + /* + * If the color appears anywhere in the high colormap, treat its number of + * members as uncertain. In principle we could determine all the specific + * chrs corresponding to each such entry, but it would be expensive + * (particularly if character class tests are required) and it doesn't + * seem worth it. + */ + if (cm->cd[co].nuchrs != 0) + return -1; + + /* OK, return the known number of member chrs */ + return cm->cd[co].nschrs; } /* @@ -222,6 +226,7 @@ pg_reg_getcharacters(const regex_t *regex, int co, pg_wchar *chars, int chars_len) { struct colormap *cm; + chr c; assert(regex != NULL && regex->re_magic == REMAGIC); cm = &((struct guts *) regex->re_guts)->cmap; @@ -231,62 +236,17 @@ pg_reg_getcharacters(const regex_t *regex, int co, if (cm->cd[co].flags & PSEUDO) return; - /* Recursively search the colormap tree */ - scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len); -} - -/* - * Recursively scan the colormap tree to find chrs belonging to color "co". - * See regex/README for info about the tree structure. - * - * t: tree block to scan - * level: level (from 0) of t - * partial: partial chr code for chrs within t - * chars, chars_len: output area - */ -static void -scancolormap(struct colormap * cm, int co, - union tree * t, int level, chr partial, - pg_wchar **chars, int *chars_len) -{ - int i; - - if (level < NBYTS - 1) - { - /* non-leaf node */ - for (i = 0; i < BYTTAB; i++) - { - /* - * We do not support search for chrs of color 0 (WHITE), so - * all-white subtrees need not be searched. These can be - * recognized because they are represented by the fill blocks in - * the colormap struct. This typically allows us to avoid - * scanning large regions of higher-numbered chrs. - */ - if (t->tptr[i] == &cm->tree[level + 1]) - continue; - - /* Recursively scan next level down */ - scancolormap(cm, co, - t->tptr[i], level + 1, - (partial | (chr) i) << BYTBITS, - chars, chars_len); - } - } - else + /* + * We need only examine the low character map; there should not be any + * matching entries in the high map. + */ + for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++) { - /* leaf node */ - for (i = 0; i < BYTTAB; i++) + if (cm->locolormap[c - CHR_MIN] == co) { - if (t->tcolor[i] == co) - { - if (*chars_len > 0) - { - **chars = partial | (chr) i; - (*chars)++; - (*chars_len)--; - } - } + *chars++ = c; + if (--chars_len == 0) + break; } } } diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c index 04b6449a20c..f3387152250 100644 --- a/src/backend/regex/regprefix.c +++ b/src/backend/regex/regprefix.c @@ -194,7 +194,10 @@ findprefix(struct cnfa * cnfa, if (thiscolor == COLORLESS) break; /* The color must be a singleton */ - if (cm->cd[thiscolor].nchrs != 1) + if (cm->cd[thiscolor].nschrs != 1) + break; + /* Must not have any high-color-map entries */ + if (cm->cd[thiscolor].nuchrs != 0) break; /* |