diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2013-04-09 01:05:55 -0400 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2013-04-09 01:06:54 -0400 |
commit | 3ccae48f44d993351e1f881761bd6c556ebd6638 (patch) | |
tree | ec365a856f56aae32745644164d0a8b5889ee5fd /doc/src | |
parent | e60d20a35e436cef3c454bfeab34d8ea71b54910 (diff) | |
download | postgresql-3ccae48f44d993351e1f881761bd6c556ebd6638.tar.gz postgresql-3ccae48f44d993351e1f881761bd6c556ebd6638.zip |
Support indexing of regular-expression searches in contrib/pg_trgm.
This works by extracting trigrams from the given regular expression,
in generally the same spirit as the previously-existing support for
LIKE searches, though of course the details are far more complicated.
Currently, only GIN indexes are supported. We might be able to make
it work with GiST indexes later.
The implementation includes adding API functions to backend/regex/
to provide a view of the search NFA created from a regular expression.
These functions are meant to be generic enough to be supportable in
a standalone version of the regex library, should that ever happen.
Alexander Korotkov, reviewed by Heikki Linnakangas and Tom Lane
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/pgtrgm.sgml | 43 |
1 files changed, 38 insertions, 5 deletions
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml index 30e535557c3..4572750f4d7 100644 --- a/doc/src/sgml/pgtrgm.sgml +++ b/doc/src/sgml/pgtrgm.sgml @@ -9,7 +9,7 @@ <para> The <filename>pg_trgm</filename> module provides functions and operators - for determining the similarity of <acronym>ASCII</> + for determining the similarity of alphanumeric text based on trigram matching, as well as index operator classes that support fast searching for similar strings. @@ -28,7 +28,9 @@ <note> <para> - A string is considered to have two spaces + <filename>pg_trgm</filename> ignores non-word characters + (non-alphanumerics) when extracting trigrams from a string. + Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string. For example, the set of trigrams in the string @@ -37,6 +39,16 @@ <quote><literal> ca</literal></quote>, <quote><literal>cat</literal></quote>, and <quote><literal>at </literal></quote>. + The set of trigrams in the string + <quote><literal>foo|bar</literal></quote> is + <quote><literal> f</literal></quote>, + <quote><literal> fo</literal></quote>, + <quote><literal>foo</literal></quote>, + <quote><literal>oo </literal></quote>, + <quote><literal> b</literal></quote>, + <quote><literal> ba</literal></quote>, + <quote><literal>bar</literal></quote>, and + <quote><literal>ar </literal></quote>. </para> </note> </sect2> @@ -145,9 +157,10 @@ operator classes that allow you to create an index over a text column for the purpose of very fast similarity searches. These index types support the above-described similarity operators, and additionally support - trigram-based index searches for <literal>LIKE</> and <literal>ILIKE</> - queries. (These indexes do not support equality nor simple comparison - operators, so you may need a regular B-tree index too.) + trigram-based index searches for <literal>LIKE</>, <literal>ILIKE</>, + <literal>~</> and <literal>~*</> queries. (These indexes do not + support equality nor simple comparison operators, so you may need a + regular B-tree index too.) </para> <para> @@ -203,6 +216,26 @@ SELECT * FROM test_trgm WHERE t LIKE '%foo%bar'; </para> <para> + Beginning in <productname>PostgreSQL</> 9.3, <filename>pg_trgm</filename> + GIN indexes also support index searches for regular-expression matches + (<literal>~</> and <literal>~*</> operators), for example +<programlisting> +SELECT * FROM test_trgm WHERE t ~ '(foo|bar)'; +</programlisting> + The index search works by extracting trigrams from the regular expression + and then looking these up in the index. The more trigrams that can be + extracted from the regular expression, the more effective the index search + is. Unlike B-tree based searches, the search string need not be + left-anchored. + </para> + + <para> + For both <literal>LIKE</> and regular-expression searches, keep in mind + that a pattern with no extractable trigrams will degenerate to a full-index + scan. + </para> + + <para> The choice between GiST and GIN indexing depends on the relative performance characteristics of GiST and GIN, which are discussed elsewhere. As a rule of thumb, a GIN index is faster to search than a GiST index, but |