diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2007-09-04 03:46:36 +0000 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2007-09-04 03:46:36 +0000 |
commit | fcc6756341a03350854545c71b5384e804de1209 (patch) | |
tree | 0d78f3aa4fcd5311c196235dcb567ba807481b5e | |
parent | 6d871a2538d55a74034face43dde1f9ceaedc151 (diff) | |
download | postgresql-fcc6756341a03350854545c71b5384e804de1209.tar.gz postgresql-fcc6756341a03350854545c71b5384e804de1209.zip |
Sync examples of psql \dF output with current CVS HEAD behavior.
Random other wordsmithing.
-rw-r--r-- | doc/src/sgml/textsearch.sgml | 327 |
1 files changed, 185 insertions, 142 deletions
diff --git a/doc/src/sgml/textsearch.sgml b/doc/src/sgml/textsearch.sgml index b165011bc8d..72da3aae259 100644 --- a/doc/src/sgml/textsearch.sgml +++ b/doc/src/sgml/textsearch.sgml @@ -1,7 +1,15 @@ +<!-- $PostgreSQL: pgsql/doc/src/sgml/textsearch.sgml,v 1.16 2007/09/04 03:46:36 tgl Exp $ --> + <chapter id="textsearch"> + <title id="textsearch-title">Full Text Search</title> - <title>Full Text Search</title> + <indexterm zone="textsearch"> + <primary>full text search</primary> + </indexterm> + <indexterm zone="textsearch"> + <primary>text search</primary> + </indexterm> <sect1 id="textsearch-intro"> <title>Introduction</title> @@ -67,43 +75,52 @@ <listitem> <para> <emphasis>Parsing documents into <firstterm>lexemes</></emphasis>. It is - useful to identify various lexemes, e.g. digits, words, complex words, - email addresses, so they can be processed differently. In principle - lexemes depend on the specific application but for an ordinary search it - is useful to have a predefined list of lexemes. <!-- add list of lexemes. - --> + useful to identify various classes of lexemes, e.g. digits, words, + complex words, email addresses, so that they can be processed + differently. In principle lexeme classes depend on the specific + application but for an ordinary search it is useful to have a predefined + set of classes. + <productname>PostgreSQL</productname> uses a <firstterm>parser</> to + perform this step. A standard parser is provided, and custom parsers + can be created for specific needs. </para> </listitem> <listitem> <para> - <emphasis>Dictionaries</emphasis> allow the conversion of lexemes into - a <emphasis>normalized form</emphasis> so it is not necessary to enter - search words in a specific form. + <emphasis>Converting lexemes into <firstterm>normalized + form</></emphasis>. This allows searches to find variant forms of the + same word, without tediously entering all the possible variants. + Also, this step typically eliminates <firstterm>stop words</>, which + are words that are so common that they are useless for searching. + <productname>PostgreSQL</productname> uses <firstterm>dictionaries</> to + perform this step. Various standard dictionaries are provided, and + custom ones can be created for specific needs. </para> </listitem> <listitem> <para> - <emphasis>Store</emphasis> preprocessed documents optimized for - searching. For example, represent each document as a sorted array - of lexemes. Along with lexemes it is desirable to store positional - information to use for <varname>proximity ranking</varname>, so that - a document which contains a more "dense" region of query words is + <emphasis>Storing preprocessed documents optimized for + searching</emphasis>. For example, each document can be represented + as a sorted array of normalized lexemes. Along with the lexemes it is + desirable to store positional information to use for <firstterm>proximity + ranking</firstterm>, so that a document which contains a more + <quote>dense</> region of query words is assigned a higher rank than one with scattered query words. </para> </listitem> </itemizedlist> <para> - Dictionaries allow fine-grained control over how lexemes are created. With - dictionaries you can: + Dictionaries allow fine-grained control over how lexemes are normalized. + With dictionaries you can: </para> <itemizedlist spacing="compact" mark="bullet"> <listitem> <para> - Define "stop words" that should not be indexed. + Define stop words that should not be indexed. </para> </listitem> @@ -135,13 +152,12 @@ </itemizedlist> <para> - A data type (<xref linkend="datatype-textsearch">), <type>tsvector</type> - is provided, for storing preprocessed documents, - along with a type <type>tsquery</type> for representing textual - queries. Also, a full text search operator <literal>@@</literal> is defined - for these data types (<xref linkend="textsearch-searches">). Full text - searches can be accelerated using indexes (<xref - linkend="textsearch-indexes">). + A data type <type>tsvector</type> is provided for storing preprocessed + documents, along with a type <type>tsquery</type> for representing processed + queries (<xref linkend="datatype-textsearch">). Also, a full text search + operator <literal>@@</literal> is defined for these data types (<xref + linkend="textsearch-searches">). Full text searches can be accelerated + using indexes (<xref linkend="textsearch-indexes">). </para> @@ -154,20 +170,20 @@ </indexterm> <para> - A document can be a simple text file stored in the file system. The full - text indexing engine can parse text files and store associations of lexemes - (words) with their parent document. Later, these associations are used to - search for documents which contain query words. In this case, the database - can be used to store the full text index and for executing searches, and - some unique identifier can be used to retrieve the document from the file - system. + A <firstterm>document</> is the unit of searching in a full text search + system; for example, a magazine article or email message. The text search + engine must be able to parse documents and store associations of lexemes + (key words) with their parent document. Later, these associations are + used to search for documents which contain query words. </para> <para> - A document can also be any textual database attribute or a combination - (concatenation), which in turn can be stored in various tables or obtained - dynamically. In other words, a document can be constructed from different - parts for indexing and it might not exist as a whole. For example: + For searches within <productname>PostgreSQL</productname>, + a document is normally a textual field within a row of a database table, + or possibly a combination (concatenation) of such fields, perhaps stored + in several tables or obtained dynamically. In other words, a document can + be constructed from different parts for indexing and it might not be + stored anywhere as a whole. For example: <programlisting> SELECT title || ' ' || author || ' ' || abstract || ' ' || body AS document @@ -184,10 +200,20 @@ WHERE mid = did AND mid = 12; <para> Actually, in the previous example queries, <literal>COALESCE</literal> <!-- TODO make this a link? --> - should be used to prevent a <literal>NULL</literal> attribute from causing - a <literal>NULL</literal> result. + should be used to prevent a simgle <literal>NULL</literal> attribute from + causing a <literal>NULL</literal> result for the whole document. </para> </note> + + <para> + Another possibility is to store the documents as simple text files in the + file system. In this case, the database can be used to store the full text + index and to execute searches, and some unique identifier can be used to + retrieve the document from the file system. However, retrieving files + from outside the database requires superuser permissions or special + function support, so this is usually less convenient than keeping all + the data inside <productname>PostgreSQL</productname>. + </para> </sect2> <sect2 id="textsearch-searches"> @@ -261,8 +287,9 @@ SELECT 'fat & cow'::tsquery @@ 'a fat cat sat on a mat and ate a fat rat'::t <xref linkend="guc-default-text-search-config"> was set accordingly in <filename>postgresql.conf</>. If you are using the same text search configuration for the entire cluster you can use the value in - <filename>postgresql.conf</>. If using different configurations but - the same text search configuration for an entire database, + <filename>postgresql.conf</>. If using different configurations + throughout the cluster but + the same text search configuration for any one database, use <command>ALTER DATABASE ... SET</>. If not, you must set <varname> default_text_search_config</varname> in each session. Many functions also take an optional configuration name. @@ -555,7 +582,7 @@ UPDATE tt SET ti= <term> <synopsis> - ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> TEXT) returns SETOF <type>tokenout</type> + ts_parse(<replaceable class="PARAMETER">parser</replaceable>, <replaceable class="PARAMETER">document</replaceable> text, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">token</> text) returns SETOF RECORD </synopsis> </term> @@ -588,7 +615,7 @@ SELECT * FROM ts_parse('default','123 - a number'); <term> <synopsis> - ts_token_type(<replaceable class="PARAMETER">parser</replaceable> ) returns SETOF <type>tokentype</type> + ts_token_type(<replaceable class="PARAMETER">parser</>, OUT <replaceable class="PARAMETER">tokid</> integer, OUT <replaceable class="PARAMETER">alias</> text, OUT <replaceable class="PARAMETER">description</> text) returns SETOF RECORD </synopsis> </term> @@ -1107,20 +1134,20 @@ SELECT ts_lexize('english_stem', 'stars'); (1 row) </programlisting> - Also, the <function>ts_debug</function> function (<xref linkend="textsearch-debugging">) - can be used for this. + Also, the <function>ts_debug</function> function (<xref + linkend="textsearch-debugging">) is helpful for testing. </para> <sect2 id="textsearch-stopwords"> <title>Stop Words</title> <para> - Stop words are words which are very common, appear in almost - every document, and have no discrimination value. Therefore, they can be ignored - in the context of full text searching. For example, every English text contains - words like <literal>a</literal> although it is useless to store them in an index. - However, stop words do affect the positions in <type>tsvector</type>, - which in turn, do affect ranking: + Stop words are words which are very common, appear in almost every + document, and have no discrimination value. Therefore, they can be ignored + in the context of full text searching. For example, every English text + contains words like <literal>a</literal> and <literal>the</>, so it is + useless to store them in an index. However, stop words do affect the + positions in <type>tsvector</type>, which in turn affect ranking: <programlisting> SELECT to_tsvector('english','in the list of stop words'); @@ -1542,11 +1569,15 @@ SELECT ts_lexize('norwegian_ispell','sjokoladefabrikk'); <para> The <application>Snowball</> dictionary template is based on the project of Martin Porter, inventor of the popular Porter's stemming algorithm - for the English language and now supported in many languages (see the <ulink - url="http://snowball.tartarus.org">Snowball site</ulink> for more - information). The Snowball project supplies a large number of stemmers for - many languages. A Snowball dictionary requires a language parameter to - identify which stemmer to use, and optionally can specify a stopword file name. + for the English language. Snowball now provides stemming algorithms for + many languages (see the <ulink url="http://snowball.tartarus.org">Snowball + site</ulink> for more information). Each algorithm understands how to + reduce common variant forms of words to a base, or stem, spelling within + its language. A Snowball dictionary requires a language parameter to + identify which stemmer to use, and optionally can specify a stopword file + name that gives a list of words to eliminate. + (<productname>PostgreSQL</productname>'s standard stopword lists are also + provided by the Snowball project.) For example, there is a built-in definition equivalent to <programlisting> @@ -1782,7 +1813,7 @@ version of our software: PostgreSQL 8.3. <programlisting> => \dF - List of fulltext configurations + List of text search configurations Schema | Name | Description ---------+------+------------- public | pg | @@ -2053,24 +2084,24 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); <para> Information about full text searching objects can be obtained - in <literal>psql</literal> using a set of commands: + in <application>psql</application> using a set of commands: <synopsis> - \dF{,d,p}<optional>+</optional> <optional>PATTERN</optional> + \dF{d,p,t}<optional>+</optional> <optional>PATTERN</optional> </synopsis> An optional <literal>+</literal> produces more details. </para> <para> The optional parameter <literal>PATTERN</literal> should be the name of - a full text searching object, optionally schema-qualified. If + a text searching object, optionally schema-qualified. If <literal>PATTERN</literal> is not specified then information about all - visible objects will be displayed. <literal>PATTERN</literal> can be a - regular expression and can apply <emphasis>separately</emphasis> to schema - names and object names. The following examples illustrate this: + visible objects will be displayed. <literal>PATTERN</literal> can be a + regular expression and can provide <emphasis>separate</emphasis> patterns + for the schema and object names. The following examples illustrate this: <programlisting> => \dF *fulltext* - List of fulltext configurations + List of text search configurations Schema | Name | Description --------+--------------+------------- public | fulltext_cfg | @@ -2078,7 +2109,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); <programlisting> => \dF *.fulltext* - List of fulltext configurations + List of text search configurations Schema | Name | Description ----------+---------------------------- fulltext | fulltext_cfg | @@ -2093,46 +2124,42 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); <listitem> <para> - List full text searching configurations (add "+" for more detail) - </para> - <para> - By default (without <literal>PATTERN</literal>), information about - all <emphasis>visible</emphasis> full text configurations will be - displayed. + List text searching configurations (add <literal>+</> for more detail). </para> + <para> <programlisting> => \dF russian - List of fulltext configurations - Schema | Name | Description -------------+---------+----------------------------------- - pg_catalog | russian | default configuration for Russian + List of text search configurations + Schema | Name | Description +------------+---------+------------------------------------ + pg_catalog | russian | configuration for russian language => \dF+ russian - Configuration "pg_catalog.russian" - Parser name: "pg_catalog.default" - Token | Dictionaries ---------------+------------------------- - email | pg_catalog.simple - file | pg_catalog.simple - float | pg_catalog.simple - host | pg_catalog.simple - hword | pg_catalog.russian_stem - int | pg_catalog.simple - lhword | public.tz_simple - lpart_hword | public.tz_simple - lword | public.tz_simple - nlhword | pg_catalog.russian_stem - nlpart_hword | pg_catalog.russian_stem - nlword | pg_catalog.russian_stem - part_hword | pg_catalog.simple - sfloat | pg_catalog.simple - uint | pg_catalog.simple - uri | pg_catalog.simple - url | pg_catalog.simple - version | pg_catalog.simple - word | pg_catalog.russian_stem +Text search configuration "pg_catalog.russian" +Parser: "pg_catalog.default" + Token | Dictionaries +--------------+-------------- + email | simple + file | simple + float | simple + host | simple + hword | russian_stem + int | simple + lhword | english_stem + lpart_hword | english_stem + lword | english_stem + nlhword | russian_stem + nlpart_hword | russian_stem + nlword | russian_stem + part_hword | russian_stem + sfloat | simple + uint | simple + uri | simple + url | simple + version | simple + word | russian_stem </programlisting> </para> </listitem> @@ -2142,35 +2169,31 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); <term>\dFd[+] [PATTERN]</term> <listitem> <para> - List full text dictionaries (add "+" for more detail). - </para> - <para> - By default (without <literal>PATTERN</literal>), information about - all <emphasis>visible</emphasis> dictionaries will be displayed. + List text search dictionaries (add <literal>+</> for more detail). </para> <para> <programlisting> => \dFd - List of fulltext dictionaries - Schema | Name | Description -------------+------------+----------------------------------------------------------- - pg_catalog | danish | Snowball stemmer for danish language - pg_catalog | dutch | Snowball stemmer for dutch language - pg_catalog | english | Snowball stemmer for english language - pg_catalog | finnish | Snowball stemmer for finnish language - pg_catalog | french | Snowball stemmer for french language - pg_catalog | german | Snowball stemmer for german language - pg_catalog | hungarian | Snowball stemmer for hungarian language - pg_catalog | italian | Snowball stemmer for italian language - pg_catalog | norwegian | Snowball stemmer for norwegian language - pg_catalog | portuguese | Snowball stemmer for portuguese language - pg_catalog | romanian | Snowball stemmer for romanian language - pg_catalog | russian | Snowball stemmer for russian language - pg_catalog | simple | simple dictionary: just lower case and check for stopword - pg_catalog | spanish | Snowball stemmer for spanish language - pg_catalog | swedish | Snowball stemmer for swedish language - pg_catalog | turkish | Snowball stemmer for turkish language + List of text search dictionaries + Schema | Name | Description +------------+-----------------+----------------------------------------------------------- + pg_catalog | danish_stem | snowball stemmer for danish language + pg_catalog | dutch_stem | snowball stemmer for dutch language + pg_catalog | english_stem | snowball stemmer for english language + pg_catalog | finnish_stem | snowball stemmer for finnish language + pg_catalog | french_stem | snowball stemmer for french language + pg_catalog | german_stem | snowball stemmer for german language + pg_catalog | hungarian_stem | snowball stemmer for hungarian language + pg_catalog | italian_stem | snowball stemmer for italian language + pg_catalog | norwegian_stem | snowball stemmer for norwegian language + pg_catalog | portuguese_stem | snowball stemmer for portuguese language + pg_catalog | romanian_stem | snowball stemmer for romanian language + pg_catalog | russian_stem | snowball stemmer for russian language + pg_catalog | simple | simple dictionary: just lower case and check for stopword + pg_catalog | spanish_stem | snowball stemmer for spanish language + pg_catalog | swedish_stem | snowball stemmer for swedish language + pg_catalog | turkish_stem | snowball stemmer for turkish language </programlisting> </para> </listitem> @@ -2181,32 +2204,28 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); <term>\dFp[+] [PATTERN]</term> <listitem> <para> - List full text parsers (add "+" for more detail) - </para> - <para> - By default (without <literal>PATTERN</literal>), information about - all <emphasis>visible</emphasis> full text parsers will be displayed. + List text search parsers (add <literal>+</> for more detail). </para> + <para> <programlisting> - => \dFp - List of fulltext parsers - Schema | Name | Description +=> \dFp + List of text search parsers + Schema | Name | Description ------------+---------+--------------------- pg_catalog | default | default word parser - (1 row) => \dFp+ - Fulltext parser "pg_catalog.default" - Method | Function | Description --------------------+---------------------------+------------- - Start parse | pg_catalog.prsd_start | - Get next token | pg_catalog.prsd_nexttoken | - End parse | pg_catalog.prsd_end | - Get headline | pg_catalog.prsd_headline | - Get lexeme's type | pg_catalog.prsd_lextype | - - Token's types for parser "pg_catalog.default" - Token name | Description + Text search parser "pg_catalog.default" + Method | Function | Description +------------------+----------------+------------- + Start parse | prsd_start | + Get next token | prsd_nexttoken | + End parse | prsd_end | + Get headline | prsd_headline | + Get lexeme types | prsd_lextype | + + Token types for parser "pg_catalog.default" + Token name | Description --------------+----------------------------------- blank | Space symbols email | Email @@ -2237,6 +2256,30 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); </listitem> </varlistentry> + <varlistentry> + + <term>\dFt[+] [PATTERN]</term> + <listitem> + <para> + List text search templates (add <literal>+</> for more detail). + </para> + + <para> +<programlisting> +=> \dFt + List of text search templates + Schema | Name | Description +------------+-----------+----------------------------------------------------------- + pg_catalog | ispell | ispell dictionary + pg_catalog | simple | simple dictionary: just lower case and check for stopword + pg_catalog | snowball | snowball stemmer + pg_catalog | synonym | synonym dictionary: replace word by its synonym + pg_catalog | thesaurus | thesaurus dictionary: phrase by phrase substitution +</programlisting> + </para> + </listitem> + </varlistentry> + </variablelist> </sect1> @@ -2261,7 +2304,7 @@ EXPLAIN SELECT * FROM apod WHERE textsearch @@@ to_tsquery('supernovae:a'); </para> <para> - <replaceable class="PARAMETER">ts_debug</replaceable> type defined as: + <replaceable class="PARAMETER">ts_debug</replaceable>'s result type is defined as: <programlisting> CREATE TYPE ts_debug AS ( @@ -2297,7 +2340,7 @@ ALTER TEXT SEARCH CONFIGURATION public.english <programlisting> SELECT * FROM ts_debug('public.english','The Brightest supernovaes'); - Alias | Description | Token | Dicts list | Lexized token + Alias | Description | Token | Dictionaries | Lexized token -------+---------------+-------------+---------------------------------------+--------------------------------- lword | Latin word | The | {public.english_ispell,pg_catalog.english_stem} | public.english_ispell: {} blank | Space symbols | | | |