aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTom Lane <tgl@sss.pgh.pa.us>2007-09-04 03:46:36 +0000
committerTom Lane <tgl@sss.pgh.pa.us>2007-09-04 03:46:36 +0000
commitfcc6756341a03350854545c71b5384e804de1209 (patch)
tree0d78f3aa4fcd5311c196235dcb567ba807481b5e
parent6d871a2538d55a74034face43dde1f9ceaedc151 (diff)
downloadpostgresql-fcc6756341a03350854545c71b5384e804de1209.tar.gz
postgresql-fcc6756341a03350854545c71b5384e804de1209.zip
Sync examples of psql \dF output with current CVS HEAD behavior.
Random other wordsmithing.
-rw-r--r--doc/src/sgml/textsearch.sgml327
1 files changed, 185 insertions, 142 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml
index b165011bc8d..72da3aae259 100644
--- a/doc/src/sgml/textsearch.sgml
+++ b/doc/src/sgml/textsearch.sgml
@@ -1,7 +1,15 @@
+<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.16 2007/09/04 03:46:36 tgl Exp $ -->
+
<chapter id="textsearch">
+ <title id="textsearch-title">Full Text Search</title>
- <title>Full Text Search</title>
+ <indexterm zone="textsearch">
+ <primary>full text search</primary>
+ </indexterm>
+ <indexterm zone="textsearch">
+ <primary>text search</primary>
+ </indexterm>
<sect1 id="textsearch-intro">
<title>Introduction</title>
@@ -67,43 +75,52 @@
<listitem>
<para>
<emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is
- useful to identify various lexemes, e.g. digits, words, complex words,
- email addresses, so they can be processed differently. In principle
- lexemes depend on the specific application but for an ordinary search it
- is useful to have a predefined list of lexemes. <!-- add list of lexemes.
- -->
+ useful to identify various classes of lexemes, e.g. digits, words,
+ complex words, email addresses, so that they can be processed
+ differently. In principle lexeme classes depend on the specific
+ application but for an ordinary search it is useful to have a predefined
+ set of classes.
+ <productname>PostgreSQL</productname> uses a <firstterm>parser</> to
+ perform this step. A standard parser is provided, and custom parsers
+ can be created for specific needs.
</para>
</listitem>
<listitem>
<para>
- <emphasis>Dictionaries</emphasis> allow the conversion of lexemes into
- a <emphasis>normalized form</emphasis> so it is not necessary to enter
- search words in a specific form.
+ <emphasis>Converting lexemes into <firstterm>normalized
+ form</></emphasis>. This allows searches to find variant forms of the
+ same word, without tediously entering all the possible variants.
+ Also, this step typically eliminates <firstterm>stop words</>, which
+ are words that are so common that they are useless for searching.
+ <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to
+ perform this step. Various standard dictionaries are provided, and
+ custom ones can be created for specific needs.
</para>
</listitem>
<listitem>
<para>
- <emphasis>Store</emphasis> preprocessed documents optimized for
- searching. For example, represent each document as a sorted array
- of lexemes. Along with lexemes it is desirable to store positional
- information to use for <varname>proximity ranking</varname>, so that
- a document which contains a more "dense" region of query words is
+ <emphasis>Storing preprocessed documents optimized for
+ searching</emphasis>. For example, each document can be represented
+ as a sorted array of normalized lexemes. Along with the lexemes it is
+ desirable to store positional information to use for <firstterm>proximity
+ ranking</firstterm>, so that a document which contains a more
+ <quote>dense</> region of query words is
assigned a higher rank than one with scattered query words.
</para>
</listitem>
</itemizedlist>
<para>
- Dictionaries allow fine-grained control over how lexemes are created. With
- dictionaries you can:
+ Dictionaries allow fine-grained control over how lexemes are normalized.
+ With dictionaries you can:
</para>
<itemizedlist spacing="compact" mark="bullet">
<listitem>
<para>
- Define "stop words" that should not be indexed.
+ Define stop words that should not be indexed.
</para>
</listitem>
@@ -135,13 +152,12 @@
</itemizedlist>
<para>
- A data type (<xref linkend="datatype-textsearch">), <type>tsvector</type>
- is provided, for storing preprocessed documents,
- along with a type <type>tsquery</type> for representing textual
- queries. Also, a full text search operator <literal>@@</literal> is defined
- for these data types (<xref linkend="textsearch-searches">). Full text
- searches can be accelerated using indexes (<xref
- linkend="textsearch-indexes">).
+ A data type <type>tsvector</type> is provided for storing preprocessed
+ documents, along with a type <type>tsquery</type> for representing processed
+ queries (<xref linkend="datatype-textsearch">). Also, a full text search
+ operator <literal>@@</literal> is defined for these data types (<xref
+ linkend="textsearch-searches">). Full text searches can be accelerated
+ using indexes (<xref linkend="textsearch-indexes">).
</para>
@@ -154,20 +170,20 @@
</indexterm>
<para>
- A document can be a simple text file stored in the file system. The full
- text indexing engine can parse text files and store associations of lexemes
- (words) with their parent document. Later, these associations are used to
- search for documents which contain query words. In this case, the database
- can be used to store the full text index and for executing searches, and
- some unique identifier can be used to retrieve the document from the file
- system.
+ A <firstterm>document</> is the unit of searching in a full text search
+ system; for example, a magazine article or email message. The text search
+ engine must be able to parse documents and store associations of lexemes
+ (key words) with their parent document. Later, these associations are
+ used to search for documents which contain query words.
</para>
<para>
- A document can also be any textual database attribute or a combination
- (concatenation), which in turn can be stored in various tables or obtained
- dynamically. In other words, a document can be constructed from different
- parts for indexing and it might not exist as a whole. For example:
+ For searches within <productname>PostgreSQL</productname>,
+ a document is normally a textual field within a row of a database table,
+ or possibly a combination (concatenation) of such fields, perhaps stored
+ in several tables or obtained dynamically. In other words, a document can
+ be constructed from different parts for indexing and it might not be
+ stored anywhere as a whole. For example:
<programlisting>
SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document
@@ -184,10 +200,20 @@ WHERE mid = did AND mid = 12;
<para>
Actually, in the previous example queries, <literal>COALESCE</literal>
<!-- TODO make this a link? -->
- should be used to prevent a <literal>NULL</literal> attribute from causing
- a <literal>NULL</literal> result.
+ should be used to prevent a simgle <literal>NULL</literal> attribute from
+ causing a <literal>NULL</literal> result for the whole document.
</para>
</note>
+
+ <para>
+ Another possibility is to store the documents as simple text files in the
+ file system. In this case, the database can be used to store the full text
+ index and to execute searches, and some unique identifier can be used to
+ retrieve the document from the file system. However, retrieving files
+ from outside the database requires superuser permissions or special
+ function support, so this is usually less convenient than keeping all
+ the data inside <productname>PostgreSQL</productname>.
+ </para>
</sect2>
<sect2 id="textsearch-searches">
@@ -261,8 +287,9 @@ SELECT 'fat &amp; cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t
<xref linkend="guc-default-text-search-config"> was set accordingly
in <filename>postgresql.conf</>. If you are using the same text search
configuration for the entire cluster you can use the value in
- <filename>postgresql.conf</>. If using different configurations but
- the same text search configuration for an entire database,
+ <filename>postgresql.conf</>. If using different configurations
+ throughout the cluster but
+ the same text search configuration for any one database,
use <command>ALTER DATABASE ... SET</>. If not, you must set <varname>
default_text_search_config</varname> in each session. Many functions
also take an optional configuration name.
@@ -555,7 +582,7 @@ UPDATE tt SET ti=
<term>
<synopsis>
- ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF <type>tokenout</type>
+ ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> text, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">token</> text) returns SETOF RECORD
</synopsis>
</term>
@@ -588,7 +615,7 @@ SELECT * FROM ts_parse('default','123 - a number');
<term>
<synopsis>
- ts_token_type(<replaceable class="PARAMETER">parser</replaceable> ) returns SETOF <type>tokentype</type>
+ ts_token_type(<replaceable class="PARAMETER">parser</>, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">alias</> text, OUT <replaceable class="PARAMETER">description</> text) returns SETOF RECORD
</synopsis>
</term>
@@ -1107,20 +1134,20 @@ SELECT ts_lexize('english_stem', 'stars');
(1 row)
</programlisting>
- Also, the <function>ts_debug</function> function (<xref linkend="textsearch-debugging">)
- can be used for this.
+ Also, the <function>ts_debug</function> function (<xref
+ linkend="textsearch-debugging">) is helpful for testing.
</para>
<sect2 id="textsearch-stopwords">
<title>Stop Words</title>
<para>
- Stop words are words which are very common, appear in almost
- every document, and have no discrimination value. Therefore, they can be ignored
- in the context of full text searching. For example, every English text contains
- words like <literal>a</literal> although it is useless to store them in an index.
- However, stop words do affect the positions in <type>tsvector</type>,
- which in turn, do affect ranking:
+ Stop words are words which are very common, appear in almost every
+ document, and have no discrimination value. Therefore, they can be ignored
+ in the context of full text searching. For example, every English text
+ contains words like <literal>a</literal> and <literal>the</>, so it is
+ useless to store them in an index. However, stop words do affect the
+ positions in <type>tsvector</type>, which in turn affect ranking:
<programlisting>
SELECT to_tsvector('english','in the list of stop words');
@@ -1542,11 +1569,15 @@ SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk');
<para>
The <application>Snowball</> dictionary template is based on the project
of Martin Porter, inventor of the popular Porter's stemming algorithm
- for the English language and now supported in many languages (see the <ulink
- url="http://snowball.tartarus.org">Snowball site</ulink> for more
- information). The Snowball project supplies a large number of stemmers for
- many languages. A Snowball dictionary requires a language parameter to
- identify which stemmer to use, and optionally can specify a stopword file name.
+ for the English language. Snowball now provides stemming algorithms for
+ many languages (see the <ulink url="http://snowball.tartarus.org">Snowball
+ site</ulink> for more information). Each algorithm understands how to
+ reduce common variant forms of words to a base, or stem, spelling within
+ its language. A Snowball dictionary requires a language parameter to
+ identify which stemmer to use, and optionally can specify a stopword file
+ name that gives a list of words to eliminate.
+ (<productname>PostgreSQL</productname>'s standard stopword lists are also
+ provided by the Snowball project.)
For example, there is a built-in definition equivalent to
<programlisting>
@@ -1782,7 +1813,7 @@ version of our software: PostgreSQL 8.3.
<programlisting>
=&gt; \dF
- List of fulltext configurations
+ List of text search configurations
Schema | Name | Description
---------+------+-------------
public | pg |
@@ -2053,24 +2084,24 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<para>
Information about full text searching objects can be obtained
- in <literal>psql</literal> using a set of commands:
+ in <application>psql</application> using a set of commands:
<synopsis>
- \dF{,d,p}<optional>+</optional> <optional>PATTERN</optional>
+ \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional>
</synopsis>
An optional <literal>+</literal> produces more details.
</para>
<para>
The optional parameter <literal>PATTERN</literal> should be the name of
- a full text searching object, optionally schema-qualified. If
+ a text searching object, optionally schema-qualified. If
<literal>PATTERN</literal> is not specified then information about all
- visible objects will be displayed. <literal>PATTERN</literal> can be a
- regular expression and can apply <emphasis>separately</emphasis> to schema
- names and object names. The following examples illustrate this:
+ visible objects will be displayed. <literal>PATTERN</literal> can be a
+ regular expression and can provide <emphasis>separate</emphasis> patterns
+ for the schema and object names. The following examples illustrate this:
<programlisting>
=&gt; \dF *fulltext*
- List of fulltext configurations
+ List of text search configurations
Schema | Name | Description
--------+--------------+-------------
public | fulltext_cfg |
@@ -2078,7 +2109,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<programlisting>
=&gt; \dF *.fulltext*
- List of fulltext configurations
+ List of text search configurations
Schema | Name | Description
----------+----------------------------
fulltext | fulltext_cfg |
@@ -2093,46 +2124,42 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<listitem>
<para>
- List full text searching configurations (add "+" for more detail)
- </para>
- <para>
- By default (without <literal>PATTERN</literal>), information about
- all <emphasis>visible</emphasis> full text configurations will be
- displayed.
+ List text searching configurations (add <literal>+</> for more detail).
</para>
+
<para>
<programlisting>
=&gt; \dF russian
- List of fulltext configurations
- Schema | Name | Description
-------------+---------+-----------------------------------
- pg_catalog | russian | default configuration for Russian
+ List of text search configurations
+ Schema | Name | Description
+------------+---------+------------------------------------
+ pg_catalog | russian | configuration for russian language
=&gt; \dF+ russian
- Configuration "pg_catalog.russian"
- Parser name: "pg_catalog.default"
- Token | Dictionaries
---------------+-------------------------
- email | pg_catalog.simple
- file | pg_catalog.simple
- float | pg_catalog.simple
- host | pg_catalog.simple
- hword | pg_catalog.russian_stem
- int | pg_catalog.simple
- lhword | public.tz_simple
- lpart_hword | public.tz_simple
- lword | public.tz_simple
- nlhword | pg_catalog.russian_stem
- nlpart_hword | pg_catalog.russian_stem
- nlword | pg_catalog.russian_stem
- part_hword | pg_catalog.simple
- sfloat | pg_catalog.simple
- uint | pg_catalog.simple
- uri | pg_catalog.simple
- url | pg_catalog.simple
- version | pg_catalog.simple
- word | pg_catalog.russian_stem
+Text search configuration "pg_catalog.russian"
+Parser: "pg_catalog.default"
+ Token | Dictionaries
+--------------+--------------
+ email | simple
+ file | simple
+ float | simple
+ host | simple
+ hword | russian_stem
+ int | simple
+ lhword | english_stem
+ lpart_hword | english_stem
+ lword | english_stem
+ nlhword | russian_stem
+ nlpart_hword | russian_stem
+ nlword | russian_stem
+ part_hword | russian_stem
+ sfloat | simple
+ uint | simple
+ uri | simple
+ url | simple
+ version | simple
+ word | russian_stem
</programlisting>
</para>
</listitem>
@@ -2142,35 +2169,31 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<term>\dFd[+] [PATTERN]</term>
<listitem>
<para>
- List full text dictionaries (add "+" for more detail).
- </para>
- <para>
- By default (without <literal>PATTERN</literal>), information about
- all <emphasis>visible</emphasis> dictionaries will be displayed.
+ List text search dictionaries (add <literal>+</> for more detail).
</para>
<para>
<programlisting>
=&gt; \dFd
- List of fulltext dictionaries
- Schema | Name | Description
-------------+------------+-----------------------------------------------------------
- pg_catalog | danish | Snowball stemmer for danish language
- pg_catalog | dutch | Snowball stemmer for dutch language
- pg_catalog | english | Snowball stemmer for english language
- pg_catalog | finnish | Snowball stemmer for finnish language
- pg_catalog | french | Snowball stemmer for french language
- pg_catalog | german | Snowball stemmer for german language
- pg_catalog | hungarian | Snowball stemmer for hungarian language
- pg_catalog | italian | Snowball stemmer for italian language
- pg_catalog | norwegian | Snowball stemmer for norwegian language
- pg_catalog | portuguese | Snowball stemmer for portuguese language
- pg_catalog | romanian | Snowball stemmer for romanian language
- pg_catalog | russian | Snowball stemmer for russian language
- pg_catalog | simple | simple dictionary: just lower case and check for stopword
- pg_catalog | spanish | Snowball stemmer for spanish language
- pg_catalog | swedish | Snowball stemmer for swedish language
- pg_catalog | turkish | Snowball stemmer for turkish language
+ List of text search dictionaries
+ Schema | Name | Description
+------------+-----------------+-----------------------------------------------------------
+ pg_catalog | danish_stem | snowball stemmer for danish language
+ pg_catalog | dutch_stem | snowball stemmer for dutch language
+ pg_catalog | english_stem | snowball stemmer for english language
+ pg_catalog | finnish_stem | snowball stemmer for finnish language
+ pg_catalog | french_stem | snowball stemmer for french language
+ pg_catalog | german_stem | snowball stemmer for german language
+ pg_catalog | hungarian_stem | snowball stemmer for hungarian language
+ pg_catalog | italian_stem | snowball stemmer for italian language
+ pg_catalog | norwegian_stem | snowball stemmer for norwegian language
+ pg_catalog | portuguese_stem | snowball stemmer for portuguese language
+ pg_catalog | romanian_stem | snowball stemmer for romanian language
+ pg_catalog | russian_stem | snowball stemmer for russian language
+ pg_catalog | simple | simple dictionary: just lower case and check for stopword
+ pg_catalog | spanish_stem | snowball stemmer for spanish language
+ pg_catalog | swedish_stem | snowball stemmer for swedish language
+ pg_catalog | turkish_stem | snowball stemmer for turkish language
</programlisting>
</para>
</listitem>
@@ -2181,32 +2204,28 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
<term>\dFp[+] [PATTERN]</term>
<listitem>
<para>
- List full text parsers (add "+" for more detail)
- </para>
- <para>
- By default (without <literal>PATTERN</literal>), information about
- all <emphasis>visible</emphasis> full text parsers will be displayed.
+ List text search parsers (add <literal>+</> for more detail).
</para>
+
<para>
<programlisting>
- =&gt; \dFp
- List of fulltext parsers
- Schema | Name | Description
+=&gt; \dFp
+ List of text search parsers
+ Schema | Name | Description
------------+---------+---------------------
pg_catalog | default | default word parser
- (1 row)
=&gt; \dFp+
- Fulltext parser "pg_catalog.default"
- Method | Function | Description
--------------------+---------------------------+-------------
- Start parse | pg_catalog.prsd_start |
- Get next token | pg_catalog.prsd_nexttoken |
- End parse | pg_catalog.prsd_end |
- Get headline | pg_catalog.prsd_headline |
- Get lexeme's type | pg_catalog.prsd_lextype |
-
- Token's types for parser "pg_catalog.default"
- Token name | Description
+ Text search parser "pg_catalog.default"
+ Method | Function | Description
+------------------+----------------+-------------
+ Start parse | prsd_start |
+ Get next token | prsd_nexttoken |
+ End parse | prsd_end |
+ Get headline | prsd_headline |
+ Get lexeme types | prsd_lextype |
+
+ Token types for parser "pg_catalog.default"
+ Token name | Description
--------------+-----------------------------------
blank | Space symbols
email | Email
@@ -2237,6 +2256,30 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
</listitem>
</varlistentry>
+ <varlistentry>
+
+ <term>\dFt[+] [PATTERN]</term>
+ <listitem>
+ <para>
+ List text search templates (add <literal>+</> for more detail).
+ </para>
+
+ <para>
+<programlisting>
+=&gt; \dFt
+ List of text search templates
+ Schema | Name | Description
+------------+-----------+-----------------------------------------------------------
+ pg_catalog | ispell | ispell dictionary
+ pg_catalog | simple | simple dictionary: just lower case and check for stopword
+ pg_catalog | snowball | snowball stemmer
+ pg_catalog | synonym | synonym dictionary: replace word by its synonym
+ pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution
+</programlisting>
+ </para>
+ </listitem>
+ </varlistentry>
+
</variablelist>
</sect1>
@@ -2261,7 +2304,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a');
</para>
<para>
- <replaceable class="PARAMETER">ts_debug</replaceable> type defined as:
+ <replaceable class="PARAMETER">ts_debug</replaceable>'s result type is defined as:
<programlisting>
CREATE TYPE ts_debug AS (
@@ -2297,7 +2340,7 @@ ALTER TEXT SEARCH CONFIGURATION public.english
<programlisting>
SELECT * FROM ts_debug('public.english','The Brightest supernovaes');
- Alias | Description | Token | Dicts list | Lexized token
+ Alias | Description | Token | Dictionaries | Lexized token
-------+---------------+-------------+---------------------------------------+---------------------------------
lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {}
blank | Space symbols | | |