aboutsummaryrefslogtreecommitdiff
path: root/doc/src
diff options
context:
space:
mode:
Diffstat (limited to 'doc/src')
-rw-r--r--doc/src/sgml/datatype.sgml28
-rw-r--r--doc/src/sgml/gin.sgml113
-rw-r--r--doc/src/sgml/textsearch.sgml19
-rw-r--r--doc/src/sgml/xindex.sgml9
4 files changed, 140 insertions, 29 deletions
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml
index fb813d70423..48dfe0a9c47 100644
--- a/doc/src/sgml/datatype.sgml
+++ b/doc/src/sgml/datatype.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ -->
<chapter id="datatype">
<title id="datatype-title">Data Types</title>
@@ -3298,18 +3298,17 @@ SELECT * FROM test;
SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
tsvector
----------------------------------------------------
- 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat'
+ 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'
</programlisting>
- (As the example shows, the sorting is first by length and then
- alphabetically, but that detail is seldom important.) To represent
+ To represent
lexemes containing whitespace or punctuation, surround them with quotes:
<programlisting>
SELECT $$the lexeme ' ' contains spaces$$::tsvector;
tsvector
-------------------------------------------
- 'the' ' ' 'lexeme' 'spaces' 'contains'
+ ' ' 'contains' 'lexeme' 'spaces' 'the'
</programlisting>
(We use dollar-quoted string literals in this example and the next one,
@@ -3320,7 +3319,7 @@ SELECT $$the lexeme ' ' contains spaces$$::tsvector;
SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
tsvector
------------------------------------------------
- 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains'
+ 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'
</programlisting>
Optionally, integer <firstterm>position(s)</>
@@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
tsvector
-------------------------------------------------------------------------------
- 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4
+ 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4
</programlisting>
A position normally indicates the source word's location in the
@@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
select 'The Fat Rats'::tsvector;
tsvector
--------------------
- 'Fat' 'The' 'Rats'
+ 'Fat' 'Rats' 'The'
</programlisting>
For most English-text-searching applications the above words would
@@ -3440,6 +3439,19 @@ SELECT 'fat:ab &amp; cat'::tsquery;
</para>
<para>
+ Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</>
+ to specify prefix matching:
+<programlisting>
+SELECT 'super:*'::tsquery;
+ tsquery
+-----------
+ 'super':*
+</programlisting>
+ This query will match any word in a <type>tsvector</> that begins
+ with <quote>super</>.
+ </para>
+
+ <para>
Quoting rules for lexemes are the same as described above for
lexemes in <type>tsvector</>; and, as with <type>tsvector</>,
any required normalization of words must be done before putting
diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml
index ad82da6b38e..961451f714a 100644
--- a/doc/src/sgml/gin.sgml
+++ b/doc/src/sgml/gin.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ -->
<chapter id="GIN">
<title>GIN Indexes</title>
@@ -52,15 +52,15 @@
</para>
<para>
- All it takes to get a <acronym>GIN</acronym> access method working
- is to implement four user-defined methods, which define the behavior of
+ All it takes to get a <acronym>GIN</acronym> access method working is to
+ implement four (or five) user-defined methods, which define the behavior of
keys in the tree and the relationships between keys, indexed values,
and indexable queries. In short, <acronym>GIN</acronym> combines
extensibility with generality, code reuse, and a clean interface.
</para>
<para>
- The four methods that an index operator class for
+ The four methods that an operator class for
<acronym>GIN</acronym> must provide are:
</para>
@@ -77,7 +77,7 @@
</varlistentry>
<varlistentry>
- <term>Datum* extractValue(Datum inputValue, int32 *nkeys)</term>
+ <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term>
<listitem>
<para>
Returns an array of keys given a value to be indexed. The
@@ -87,8 +87,8 @@
</varlistentry>
<varlistentry>
- <term>Datum* extractQuery(Datum query, int32 *nkeys,
- StrategyNumber n)</term>
+ <term>Datum *extractQuery(Datum query, int32 *nkeys,
+ StrategyNumber n, bool **pmatch)</term>
<listitem>
<para>
Returns an array of keys given a value to be queried; that is,
@@ -100,13 +100,22 @@
to consult <literal>n</> to determine the data type of
<literal>query</> and the key values that need to be extracted.
The number of returned keys must be stored into <literal>*nkeys</>.
- If number of keys is equal to zero then <function>extractQuery</>
- should store 0 or -1 into <literal>*nkeys</>. 0 means that any
- row matches the <literal>query</> and sequence scan should be
- produced. -1 means nothing can satisfy <literal>query</>.
- Choice of value should be based on semantics meaning of operation with
- given strategy number.
+ If the query contains no keys then <function>extractQuery</>
+ should store 0 or -1 into <literal>*nkeys</>, depending on the
+ semantics of the operator. 0 means that every
+ value matches the <literal>query</> and a sequential scan should be
+ produced. -1 means nothing can match the <literal>query</>.
+ <literal>pmatch</> is an output argument for use when partial match
+ is supported. To use it, <function>extractQuery</> must allocate
+ an array of <literal>*nkeys</> booleans and store its address at
+ <literal>*pmatch</>. Each element of the array should be set to TRUE
+ if the corresponding key requires partial match, FALSE if not.
+ If <literal>*pmatch</> is set to NULL then GIN assumes partial match
+ is not required. The variable is initialized to NULL before call,
+ so this argument can simply be ignored by operator classes that do
+ not support partial match.
</para>
+
</listitem>
</varlistentry>
@@ -133,6 +142,39 @@
</variablelist>
+ <para>
+ Optionally, an operator class for
+ <acronym>GIN</acronym> can supply a fifth method:
+ </para>
+
+ <variablelist>
+
+ <varlistentry>
+ <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term>
+ <listitem>
+ <para>
+ Compare a partial-match query to an index key. Returns an integer
+ whose sign indicates the result: less than zero means the index key
+ does not match the query, but the index scan should continue; zero
+ means that the index key does match the query; greater than zero
+ indicates that the index scan should stop because no more matches
+ are possible. The strategy number <literal>n</> of the operator
+ that generated the partial match query is provided, in case its
+ semantics are needed to determine when to end the scan.
+ </para>
+ </listitem>
+ </varlistentry>
+
+ </variablelist>
+
+ <para>
+ To support <quote>partial match</> queries, an operator class must
+ provide the <function>comparePartial</> method, and its
+ <function>extractQuery</> method must set the <literal>pmatch</>
+ parameter when a partial-match query is encountered. See
+ <xref linkend="gin-partial-match"> for details.
+ </para>
+
</sect1>
<sect1 id="gin-implementation">
@@ -146,6 +188,33 @@
list of heap pointers (PL, posting list) if the list is small enough.
</para>
+ <sect2 id="gin-partial-match">
+ <title>Partial match algorithm</title>
+
+ <para>
+ GIN can support <quote>partial match</> queries, in which the query
+ does not determine an exact match for one or more keys, but the possible
+ matches fall within a reasonably narrow range of key values (within the
+ key sorting order determined by the <function>compare</> support method).
+ The <function>extractQuery</> method, instead of returning a key value
+ to be matched exactly, returns a key value that is the lower bound of
+ the range to be searched, and sets the <literal>pmatch</> flag true.
+ The key range is then searched using the <function>comparePartial</>
+ method. <function>comparePartial</> must return zero for an actual
+ match, less than zero for a non-match that is still within the range
+ to be searched, or greater than zero if the index key is past the range
+ that could match.
+ </para>
+
+ <para>
+ During a partial-match scan, all <literal>itemPointer</>s for matching keys
+ are OR'ed into a <literal>TIDBitmap</>.
+ The scan fails if the <literal>TIDBitmap</> becomes lossy.
+ In this case an error message will be reported with advice
+ to increase <literal>work_mem</>.
+ </para>
+ </sect2>
+
</sect1>
<sect1 id="gin-tips">
@@ -236,8 +305,14 @@
</para>
<para>
- <acronym>GIN</acronym> searches keys only by equality matching. This might
- be improved in future.
+ It is possible for an operator class to circumvent the restriction against
+ full index scan. To do that, <function>extractValue</> must return at least
+ one (possibly dummy) key for every indexed value, and
+ <function>extractQuery</function> must convert an unrestricted search into
+ a partial-match query that will scan the whole index. This is inefficient
+ but might be necessary to avoid corner-case failures with operators such
+ as LIKE. Note however that failure could still occur if the intermediate
+ <literal>TIDBitmap</> becomes lossy.
</para>
</sect1>
@@ -247,9 +322,11 @@
<para>
The <productname>PostgreSQL</productname> source distribution includes
<acronym>GIN</acronym> operator classes for <type>tsvector</> and
- for one-dimensional arrays of all internal types. The following
- <filename>contrib</> modules also contain <acronym>GIN</acronym>
- operator classes:
+ for one-dimensional arrays of all internal types. Prefix searching in
+ <type>tsvector</> is implemented using the <acronym>GIN</> partial match
+ feature.
+ The following <filename>contrib</> modules also contain
+ <acronym>GIN</acronym> operator classes:
</para>
<variablelist>
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index caa8847ef8e..41db566b6cc 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ -->
<chapter id="textsearch">
<title id="textsearch-title">Full Text Search</title>
@@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB');
'fat' | 'rat':AB
</programlisting>
+ Also, <literal>*</> can be attached to a lexeme to specify prefix matching:
+
+<programlisting>
+SELECT to_tsquery('supern:*A &amp; star:A*B');
+ to_tsquery
+--------------------------
+ 'supern':*A &amp; 'star':*AB
+</programlisting>
+
+ Such a lexeme will match any word in a <type>tsvector</> that begins
+ with the given string.
+ </para>
+
+ <para>
<function>to_tsquery</function> can also accept single-quoted
phrases. This is primarily useful when the configuration includes a
thesaurus dictionary that may trigger on such phrases.
@@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' &amp; !crab');
</programlisting>
Note that <function>plainto_tsquery</> cannot
- recognize either Boolean operators or weight labels in its input:
+ recognize Boolean operators, weight labels, or prefix-match labels
+ in its input:
<programlisting>
SELECT plainto_tsquery('english', 'The Fat &amp; Rats:C');
diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml
index 6bf7535b636..84b2c9050a1 100644
--- a/doc/src/sgml/xindex.sgml
+++ b/doc/src/sgml/xindex.sgml
@@ -1,4 +1,4 @@
-<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ -->
+<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ -->
<sect1 id="xindex">
<title>Interfacing Extensions To Indexes</title>
@@ -444,6 +444,13 @@
<entry>consistent - determine whether value matches query condition</entry>
<entry>4</entry>
</row>
+ <row>
+ <entry>comparePartial - (optional method) compare partial key from
+ query and key from index, and return an integer less than zero, zero,
+ or greater than zero, indicating whether GIN should ignore this index
+ entry, treat the entry as a match, or stop the index scan</entry>
+ <entry>5</entry>
+ </row>
</tbody>
</tgroup>
</table>