diff options
Diffstat (limited to 'doc/src')
-rw-r--r-- | doc/src/sgml/datatype.sgml | 28 | ||||
-rw-r--r-- | doc/src/sgml/gin.sgml | 113 | ||||
-rw-r--r-- | doc/src/sgml/textsearch.sgml | 19 | ||||
-rw-r--r-- | doc/src/sgml/xindex.sgml | 9 |
4 files changed, 140 insertions, 29 deletions
diff --git a/doc/src/sgml/datatype.sgml b/doc/src/sgml/datatype.sgml index fb813d70423..48dfe0a9c47 100644 --- a/doc/src/sgml/datatype.sgml +++ b/doc/src/sgml/datatype.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.226 2008/03/30 04:08:14 neilc Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/datatype.sgml,v 1.227 2008/05/16 16:31:01 tgl Exp $ --> <chapter id="datatype"> <title id="datatype-title">Data Types</title> @@ -3298,18 +3298,17 @@ SELECT * FROM test; SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector; tsvector ---------------------------------------------------- - 'a' 'on' 'and' 'ate' 'cat' 'fat' 'mat' 'rat' 'sat' + 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat' </programlisting> - (As the example shows, the sorting is first by length and then - alphabetically, but that detail is seldom important.) To represent + To represent lexemes containing whitespace or punctuation, surround them with quotes: <programlisting> SELECT $$the lexeme ' ' contains spaces$$::tsvector; tsvector ------------------------------------------- - 'the' ' ' 'lexeme' 'spaces' 'contains' + ' ' 'contains' 'lexeme' 'spaces' 'the' </programlisting> (We use dollar-quoted string literals in this example and the next one, @@ -3320,7 +3319,7 @@ SELECT $$the lexeme ' ' contains spaces$$::tsvector; SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector; tsvector ------------------------------------------------ - 'a' 'the' 'Joe''s' 'quote' 'lexeme' 'contains' + 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the' </programlisting> Optionally, integer <firstterm>position(s)</> @@ -3330,7 +3329,7 @@ SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector; SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector; tsvector ------------------------------------------------------------------------------- - 'a':1,6,10 'on':5 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4 + 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4 </programlisting> A position normally indicates the source word's location in the @@ -3369,7 +3368,7 @@ SELECT 'a:1A fat:2B,4C cat:5D'::tsvector; select 'The Fat Rats'::tsvector; tsvector -------------------- - 'Fat' 'The' 'Rats' + 'Fat' 'Rats' 'The' </programlisting> For most English-text-searching applications the above words would @@ -3440,6 +3439,19 @@ SELECT 'fat:ab & cat'::tsquery; </para> <para> + Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</> + to specify prefix matching: +<programlisting> +SELECT 'super:*'::tsquery; + tsquery +----------- + 'super':* +</programlisting> + This query will match any word in a <type>tsvector</> that begins + with <quote>super</>. + </para> + + <para> Quoting rules for lexemes are the same as described above for lexemes in <type>tsvector</>; and, as with <type>tsvector</>, any required normalization of words must be done before putting diff --git a/doc/src/sgml/gin.sgml b/doc/src/sgml/gin.sgml index ad82da6b38e..961451f714a 100644 --- a/doc/src/sgml/gin.sgml +++ b/doc/src/sgml/gin.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.14 2008/04/14 17:05:32 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/gin.sgml,v 2.15 2008/05/16 16:31:01 tgl Exp $ --> <chapter id="GIN"> <title>GIN Indexes</title> @@ -52,15 +52,15 @@ </para> <para> - All it takes to get a <acronym>GIN</acronym> access method working - is to implement four user-defined methods, which define the behavior of + All it takes to get a <acronym>GIN</acronym> access method working is to + implement four (or five) user-defined methods, which define the behavior of keys in the tree and the relationships between keys, indexed values, and indexable queries. In short, <acronym>GIN</acronym> combines extensibility with generality, code reuse, and a clean interface. </para> <para> - The four methods that an index operator class for + The four methods that an operator class for <acronym>GIN</acronym> must provide are: </para> @@ -77,7 +77,7 @@ </varlistentry> <varlistentry> - <term>Datum* extractValue(Datum inputValue, int32 *nkeys)</term> + <term>Datum *extractValue(Datum inputValue, int32 *nkeys)</term> <listitem> <para> Returns an array of keys given a value to be indexed. The @@ -87,8 +87,8 @@ </varlistentry> <varlistentry> - <term>Datum* extractQuery(Datum query, int32 *nkeys, - StrategyNumber n)</term> + <term>Datum *extractQuery(Datum query, int32 *nkeys, + StrategyNumber n, bool **pmatch)</term> <listitem> <para> Returns an array of keys given a value to be queried; that is, @@ -100,13 +100,22 @@ to consult <literal>n</> to determine the data type of <literal>query</> and the key values that need to be extracted. The number of returned keys must be stored into <literal>*nkeys</>. - If number of keys is equal to zero then <function>extractQuery</> - should store 0 or -1 into <literal>*nkeys</>. 0 means that any - row matches the <literal>query</> and sequence scan should be - produced. -1 means nothing can satisfy <literal>query</>. - Choice of value should be based on semantics meaning of operation with - given strategy number. + If the query contains no keys then <function>extractQuery</> + should store 0 or -1 into <literal>*nkeys</>, depending on the + semantics of the operator. 0 means that every + value matches the <literal>query</> and a sequential scan should be + produced. -1 means nothing can match the <literal>query</>. + <literal>pmatch</> is an output argument for use when partial match + is supported. To use it, <function>extractQuery</> must allocate + an array of <literal>*nkeys</> booleans and store its address at + <literal>*pmatch</>. Each element of the array should be set to TRUE + if the corresponding key requires partial match, FALSE if not. + If <literal>*pmatch</> is set to NULL then GIN assumes partial match + is not required. The variable is initialized to NULL before call, + so this argument can simply be ignored by operator classes that do + not support partial match. </para> + </listitem> </varlistentry> @@ -133,6 +142,39 @@ </variablelist> + <para> + Optionally, an operator class for + <acronym>GIN</acronym> can supply a fifth method: + </para> + + <variablelist> + + <varlistentry> + <term>int comparePartial(Datum partial_key, Datum key, StrategyNumber n)</term> + <listitem> + <para> + Compare a partial-match query to an index key. Returns an integer + whose sign indicates the result: less than zero means the index key + does not match the query, but the index scan should continue; zero + means that the index key does match the query; greater than zero + indicates that the index scan should stop because no more matches + are possible. The strategy number <literal>n</> of the operator + that generated the partial match query is provided, in case its + semantics are needed to determine when to end the scan. + </para> + </listitem> + </varlistentry> + + </variablelist> + + <para> + To support <quote>partial match</> queries, an operator class must + provide the <function>comparePartial</> method, and its + <function>extractQuery</> method must set the <literal>pmatch</> + parameter when a partial-match query is encountered. See + <xref linkend="gin-partial-match"> for details. + </para> + </sect1> <sect1 id="gin-implementation"> @@ -146,6 +188,33 @@ list of heap pointers (PL, posting list) if the list is small enough. </para> + <sect2 id="gin-partial-match"> + <title>Partial match algorithm</title> + + <para> + GIN can support <quote>partial match</> queries, in which the query + does not determine an exact match for one or more keys, but the possible + matches fall within a reasonably narrow range of key values (within the + key sorting order determined by the <function>compare</> support method). + The <function>extractQuery</> method, instead of returning a key value + to be matched exactly, returns a key value that is the lower bound of + the range to be searched, and sets the <literal>pmatch</> flag true. + The key range is then searched using the <function>comparePartial</> + method. <function>comparePartial</> must return zero for an actual + match, less than zero for a non-match that is still within the range + to be searched, or greater than zero if the index key is past the range + that could match. + </para> + + <para> + During a partial-match scan, all <literal>itemPointer</>s for matching keys + are OR'ed into a <literal>TIDBitmap</>. + The scan fails if the <literal>TIDBitmap</> becomes lossy. + In this case an error message will be reported with advice + to increase <literal>work_mem</>. + </para> + </sect2> + </sect1> <sect1 id="gin-tips"> @@ -236,8 +305,14 @@ </para> <para> - <acronym>GIN</acronym> searches keys only by equality matching. This might - be improved in future. + It is possible for an operator class to circumvent the restriction against + full index scan. To do that, <function>extractValue</> must return at least + one (possibly dummy) key for every indexed value, and + <function>extractQuery</function> must convert an unrestricted search into + a partial-match query that will scan the whole index. This is inefficient + but might be necessary to avoid corner-case failures with operators such + as LIKE. Note however that failure could still occur if the intermediate + <literal>TIDBitmap</> becomes lossy. </para> </sect1> @@ -247,9 +322,11 @@ <para> The <productname>PostgreSQL</productname> source distribution includes <acronym>GIN</acronym> operator classes for <type>tsvector</> and - for one-dimensional arrays of all internal types. The following - <filename>contrib</> modules also contain <acronym>GIN</acronym> - operator classes: + for one-dimensional arrays of all internal types. Prefix searching in + <type>tsvector</> is implemented using the <acronym>GIN</> partial match + feature. + The following <filename>contrib</> modules also contain + <acronym>GIN</acronym> operator classes: </para> <variablelist> diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index caa8847ef8e..41db566b6cc 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.43 2008/04/14 17:05:32 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.44 2008/05/16 16:31:01 tgl Exp $ --> <chapter id="textsearch"> <title id="textsearch-title">Full Text Search</title> @@ -754,6 +754,20 @@ SELECT to_tsquery('english', 'Fat | Rats:AB'); 'fat' | 'rat':AB </programlisting> + Also, <literal>*</> can be attached to a lexeme to specify prefix matching: + +<programlisting> +SELECT to_tsquery('supern:*A & star:A*B'); + to_tsquery +-------------------------- + 'supern':*A & 'star':*AB +</programlisting> + + Such a lexeme will match any word in a <type>tsvector</> that begins + with the given string. + </para> + + <para> <function>to_tsquery</function> can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. @@ -798,7 +812,8 @@ SELECT to_tsquery('''supernovae stars'' & !crab'); </programlisting> Note that <function>plainto_tsquery</> cannot - recognize either Boolean operators or weight labels in its input: + recognize Boolean operators, weight labels, or prefix-match labels + in its input: <programlisting> SELECT plainto_tsquery('english', 'The Fat & Rats:C'); diff --git a/doc/src/sgml/xindex.sgml b/doc/src/sgml/xindex.sgml index 6bf7535b636..84b2c9050a1 100644 --- a/doc/src/sgml/xindex.sgml +++ b/doc/src/sgml/xindex.sgml @@ -1,4 +1,4 @@ -<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.62 2008/04/14 17:05:32 tgl Exp $ --> +<!-- $PostgreSQL: pgsql/doc/src/sgml/xindex.sgml,v 1.63 2008/05/16 16:31:01 tgl Exp $ --> <sect1 id="xindex"> <title>Interfacing Extensions To Indexes</title> @@ -444,6 +444,13 @@ <entry>consistent - determine whether value matches query condition</entry> <entry>4</entry> </row> + <row> + <entry>comparePartial - (optional method) compare partial key from + query and key from index, and return an integer less than zero, zero, + or greater than zero, indicating whether GIN should ignore this index + entry, treat the entry as a match, or stop the index scan</entry> + <entry>5</entry> + </row> </tbody> </tgroup> </table> |