aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--contrib/tsearch2/README.tsearch2333
1 files changed, 167 insertions, 166 deletions
diff --git a/contrib/tsearch2/README.tsearch2 b/contrib/tsearch2/README.tsearch2
index c3581b9be6b..890868e59af 100644
--- a/contrib/tsearch2/README.tsearch2
+++ b/contrib/tsearch2/README.tsearch2
@@ -1,95 +1,106 @@
Tsearch2 - full text search extension for PostgreSQL
- [10][Online version] of this document is available
-
- This module is sponsored by Delta-Soft Ltd., Moscow, Russia.
-
- Notice: This version is fully incompatible with old tsearch (V1),
- which was deprecated in 7.4 and obsoleted in 8.0.
-
- The Tsearch2 contrib module contains an implementation of a new data
- type tsvector - a searchable data type with indexed access. In a
- nutshell, tsvector is a set of unique words along with their
- positional information in the document, organized in a special
- structure optimized for fast access and lookup. Actually, each word
- entry, besides its position in the document, could have a weight
- attribute, describing importance of this word (at a specific) position
- in document. A set of bit-signatures of a fixed length, representing
- tsvectors, are stored in a search tree (developed using PostgreSQL
- GiST), which provides online update of full text index and fast query
- lookup. The module provides indexed access methods, queries,
- operations and supporting routines for the tsvector data type and easy
- conversion of text data to tsvector. Table driven configuration allows
- creation of custom configuration optimized for specific searches using
+ [1]Online version of this document is available
+
+ Tsearch2 - is the full text engine, fully integrated into PostgreSQL
+ RDBMS.
+
+Main features
+
+ * Full online update
+ * Supports multiple table driven configurations
+ * flexible and rich linguistic support (dictionaries, stop words),
+ thesaurus
+ * full multibyte (UTF-8) support
+ * Sophisticated ranking functions with support of proximity and
+ structure information (rank, rank_cd)
+ * Index support (GiST and Gin) with concurrency and recovery support
+ * Rich query language with query rewriting support
+ * Headline support (text fragments with highlighted search terms)
+ * Ability to plug-in custom dictionaries and parsers
+ * Template generator for tsearch2 dictionaries with [2]snowball
+ stemmer support
+ * It is mature (5 years of development)
+
+ Tsearch2, in a nutshell, provides FTS operator (contains) for the new
+ data types, representing document (tsvector) and query (tsquery).
+ Table driven configuration allows creation of custom searches using
standard SQL commands.
-
- Configuration allows you to:
- * specify the type of lexemes to be indexed and the way they are
- processed.
- * specify dictionaries to be used along with stop words recognition.
- * specify the parser used to process a document.
-
- See [11]Documentation Roadmap for links to documentation.
+
+ tsvector is a searchable data type, representing document. It is a set
+ of unique words along with their positional information in the
+ document, organized in a special structure optimized for fast access
+ and lookup. Each entry could be labelled to reflect its importance in
+ document.
+
+ tsquery is a data type for textual queries with support of boolean
+ operators. It consists of lexemes (optionally labelled) with boolean
+ operators between.
+
+ Table driven configuration allows to specify:
+ * parser, which used to break document onto lexemes
+ * what lexemes to index and the way they are processed
+ * dictionaries to be used along with stop words recognition.
OpenFTS vs Tsearch2
- OpenFTS is a middleware between application and database, so it uses
- tsearch2 as a storage, while database engine is used as a query executor
- (searching). Everything else (parsing of documents, query processing,
- linguistics) carry outs on client side. That's why OpenFTS has its own
- configuration table (fts_conf) and works with its own set of dictionaries.
- OpenFTS is more flexible, because it could be used in multi-server
- architecture with separated machines for repository of documents
- (documents could be stored in file system), database and query engine.
+ [3]OpenFTS is a middleware between application and database. OpenFTS
+ uses tsearch2 as a storage and database engine as a query executor
+ (searching). Everything else, i.e. parsing of documents, query
+ processing, linguistics, carry outs on client side. That's why OpenFTS
+ has its own configuration table (fts_conf) and works with its own set
+ of dictionaries. OpenFTS is more flexible, because it could be used in
+ multi-server architecture with separate machines for repository of
+ documents (documents could be stored in filesystem), database and
+ query engine.
+
+ See [4]Documentation Roadmap for links to documentation.
Authors
* Oleg Bartunov <oleg@sai.msu.su>, Moscow, Moscow University, Russia
- * Teodor Sigaev <teodor@sigaev.ru>, Moscow, Delta-Soft Ltd.,Russia
-
+ * Teodor Sigaev <teodor@sigaev.ru>, Moscow,Moscow University,Russia
+
Contributors
- * Robert John Shepherd and Andrew J. Kopciuch submitted
- "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
+ * Robert John Shepherd and Andrew J. Kopciuch submitted
+ "Introduction to tsearch" (Robert - tsearch v1, Andrew - tsearch
v2)
- * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
+ * Brandon Craig Rhodes wrote "Tsearch2 Guide" and "Tsearch2
Reference" and proposed new naming convention for tsearch V2
-
-Features Added with Tsearch2
- * Relevance ranking of search results
- * Table driven configuration
- * Morphology support (ispell dictionaries, snowball stemmers)
- * Headline support (text fragments with highlighted search terms)
- * Ability to plug-in custom dictionaries and parsers
- * Synonym dictionary
- * Generator of templates for dictionaries (built-in snowball stemmer
- support)
- * Statistics of indexed words is available
-
+Sponsors
+
+ * ABC Startsiden - compound words support
+ * University of Mannheim for UTF-8 support (in 8.2)
+ * jfg:networks ([5]http:www.jfg-networks.com/) for Gin - Generalized
+ Inverted index (in 8.2)
+ * Georgia Public Library Service and LibLime, Inc. for Thesaurus
+ dictionary
+ * PostGIS community - GiST Concurrency and Recovery
+
+ The authors are grateful to the Russian Foundation for Basic Research
+ and Delta-Soft Ltd., Moscow, Russia for support.
+
Limitations
- * Lexeme should be not longer than 2048 bytes
- * The number of lexemes is limited by 2^32. Note, that actual
- capacity of tsvector is depends on whether positional information
- is stored or not.
- * tsvector - the size is limited by approximately 2^20 bytes.
- * tsquery - the number of entries (lexemes and operations) < 32768
- * Positional information
- + maximal position of lexeme < 2^14 (16384)
- + lexeme could have maximum 256 positions
-
+ * Length of lexeme < 2K
+ * Length of tsvector (lexemes + positions) < 1Mb
+ * The number of lexemes < 4^32
+ * 0< Positional information < 16383
+ * No more than 256 positions per lexeme
+ * The number of nodes ( lexemes + operations) in tsquery < 32768
+
References
* GiST development site -
- [12]http://www.sai.msu.su/~megera/postgres/gist
- * OpenFTS home page - [13]http://openfts.sourceforge.net/
+ [6]http://www.sai.msu.su/~megera/postgres/gist
+ * GiN development - [7]http://www.sigaev.ru/gin/
+ * OpenFTS home page - [8]http://openfts.sourceforge.net/
* Mailing list -
- [14]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gen
- eral
-
- [15]Documentation Roadmap
-
+ [9]http://sourceforge.net/mailarchive/forum.php?forum=openfts-gene
+ ral
+
Documentation Roadmap
* Several docs are available from docs/ subdirectory
@@ -97,113 +108,103 @@ Documentation Roadmap
+ "Tsearch2 Guide" by Brandon Rhodes
+ "Tsearch2 Reference" by Brandon Rhodes
* Readme.gendict in gendict/ subdirectory
- + [16][Gendict tutorial]
-
- Online version of documentation is always available from Tsearch V2
- home page -
- [17]http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
-
+ + Also, check [10]Gendict tutorial
+ * Check [11]tsearch2 Wiki pages for various documentation
+
Support
- Authors urgently recommend people to use [18][openfts-general] or
- [19][pgsql-general] mailing lists for questions and discussions.
-
-Caution
+ Authors urgently recommend people to use [12]openfts-general or
+ [13]pgsql-general mailing lists for questions and discussions.
- In spite of apparent easy full text searching with our tsearch module
- (authors hope it's so), any serious search engine require profound
- study of various aspects, such as stop words, dictionaries, special
- parsers. Tsearch module was designed to facilitate both those cases.
-
Development History
+ Latest news
+
+ To the PostgreSQL 8.2 release we added:
+ * multibyte (UTF-8) support
+ * Thesaurus dictionary
+ * Query rewriting
+ * rank_cd relevation function now support different weights of
+ lexemes
+ * GiN support adds scalability of tsearch2
+
Pre-tsearch era
- Development of OpenFTS began in 2000 after realizing that we
- needed a search engine optimized for online updates and able to
- access metadata from the database. This is essential for online
+ Development of OpenFTS began in 2000 after realizing that we
+ need a search engine optimized for online updates with access
+ to metadata from the database. This is essential for online
news agencies, web portals, digital libraries, etc. Most search
- engines available utilize an inverted index which is very fast
- for searching but very slow for online updates. Incremental
- updates of an inverted index is a complex engineering task
- while we needed something light, free and with the ability to
- access metadata from the database. The last requirement is very
- important because in a real life application a search engine
- should always consult metadata ( topic, permissions, date
- range, version, etc.). We extensively use PostgreSQL as a
- database backend and have no intention to move from it, so the
- problem was to find a data structure and a fast way to access
- it. PostgreSQL has rather unique data type for storing sets
- (think about words) - arrays, but lacks index access to them. A
- document is parsed into lexemes, which are identified in
- various ways (e.g. stemming, morphology, dictionary), and as a
- result is reduced to an array of integer numbers. During our
- research we found a paper of Joseph Hellerstein which
- introduced an interesting data structure suitable for sets -
- RD-tree (Russian Doll tree). It looked very attractive, but
- implementing it in PostgreSQL seemed difficult because of our
- ignorance of database internals. Further research lead us to
- the idea to use GiST for implementing RD-tree, but at that time
- the GiST code had for a long while remained untouched and
- contained several bugs. After work on improving GiST for
- version 7.0.3 of PostgreSQL was done, we were able to implement
- RD-Tree and use it for index access to arrays of integers. This
- implementation was ideally suited for small arrays and
- eliminated complex joins, but was practically useless for
- indexing large arrays. The next improvement came from an idea
- to represent a document by a single bit-signature, a so-called
- superimposed signature (see "Index Structures for Databases
- Containing Data Items with Set-valued Attributes", 1997, Sven
- Helmer for details). We developeded the contrib/intarray module
- and used it for full text indexing.
-
+ engines available utilize an inverted index which is very fast
+ for searching but very slow for online updates. Incremental
+ updates of an inverted index is a complex engineering task
+ while we needed something light, free and with the ability to
+ access metadata from the database. The last requirement was
+ very important because in a real life application search engine
+ should always consult metadata ( topic, permissions, date
+ range, version, etc.). We extensively use PostgreSQL as a
+ database backend and have no intention to move from it, so the
+ problem was to find a data structure and a fast way to access
+ it. PostgreSQL has rather unique data type for storing sets
+ (think about words) - arrays, but lacks index access to them.
+ During our research we found a paper of Joseph Hellerstein, who
+ introduced an interesting data structure suitable for sets -
+ RD-tree (Russian Doll tree). Further research lead us to the
+ idea to use GiST for implementing RD-tree, but at that time the
+ GiST code was intouched for a long time and contained several
+ bugs. After work on improving GiST for version 7.0.3 of
+ PostgreSQL was done, we were able to implement RD-Tree and use
+ it for index access to arrays of integers. This implementation
+ was ideally suited for small arrays and eliminated complex
+ joins, but was practically useless for indexing large arrays.
+ The next improvement came from an idea to represent a document
+ by a single bit-signature, a so-called superimposed signature
+ (see "Index Structures for Databases Containing Data Items with
+ Set-valued Attributes", 1997, Sven Helmer for details). We
+ developeded the contrib/intarray module and used it for full
+ text indexing.
+
tsearch v1
It was inconvenient to use integer id's instead of words, so we
- introduced a new data type called 'txtidx' - a searchable data
- type (textual) with indexed access. This was a first step of
- our work on an implementation of a built-in PostgreSQL full
+ introduced a new data type called 'txtidx' - a searchable data
+ type (textual) with indexed access. This was a first step of
+ our work on an implementation of a built-in PostgreSQL full
text search engine. Even though tsearch v1 had many features of
- a search engine it lacked configuration support and relevance
- ranking. People were encouraged to use OpenFTS, which provided
- relevance ranking based on coordinate information and flexible
- configuration. OpenFTS v.0.34 is the last version based on
+ a search engine it lacked configuration support and relevance
+ ranking. People were encouraged to use OpenFTS, which provided
+ relevance ranking based on positional information and flexible
+ configuration. OpenFTS v.0.34 is the last version based on
tsearch v1.
-
+
tsearch V2
- People recognized tsearch as a powerful tool for full text
- searching and insisted on adding ranking support, better
- configurability, etc. We already thought about moving most of
- the features of OpenFTS to tsearch, and in the early 2003 we
- decided to work on a new version of tsearch - tsearch v2. We've
- abandoned auxiliary index tables which were used by OpenFTS to
- store coordinate information and modified the txtidx type to
- store them internally. Also, we've added table-driven
- configuration, support of ispell dictionaries, snowball
- stemmers and the ability to specify which types of lexemes to
- index. Also, it's now possible to generate headlines of
- documents with highlighted search terms. These changes make
- tsearch more user friendly and turn it into a really powerful
- full text search engine. After announcing the alpha version, we
- received a proposal from Brandon Rhodes to rename tsearch
- functions to be more consistent. So, we have renamed txtidx
- type to tsvector and other things as well.
-
- To allow users of tsearch v1 smooth upgrade, we named the module as
- tsearch2.
-
- Future release of OpenFTS (v.0.35) will be based on tsearch2. Brave
- people could download it from OpenFTS CVS (see link from [20][OpenFTS
- page]
+ People recognized tsearch as a powerful tool for full text
+ searching and insisted on adding ranking support, better
+ configurability, etc. We already thought about moving most of
+ the features of OpenFTS to tsearch, and in the early 2003 we
+ decided to work on a new version of tsearch. We abandoned
+ auxiliary index tables which were used by OpenFTS to store
+ positional information and modified the txtidx type to store
+ them internally. We added table-driven configuration, support
+ of ispell dictionaries, snowball stemmers and the ability to
+ specify which types of lexemes to index. Now, it's possible to
+ generate headlines of documents with highlighted search terms.
+ These changes make tsearch more user friendly and turn it into
+ a really powerful full text search engine. Brandon Rhodes
+ proposed to rename tsearch functions for consistency and we
+ renamed txtidx type to tsvector and other things as well. To
+ allow users of tsearch v1 smooth upgrade, we named the module
+ as tsearch2. Since version 0.35 OpenFTS uses tsearch2.
References
- 10. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
- 11. http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Readme#Documentation_Roadmap
- 12. http://www.sai.msu.su/~megera/postgres/gist
- 13. http://openfts.sourceforge.net/
- 14. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
- 15. http://www.sai.msu.su/~megera/oddmuse/index.cgi?action=anchor&id=Documentation_Roadmap#Documentation_Roadmap
- 16. http://www.sai.msu.su/~megera/oddmuse/index.cgi?Gendict
- 17. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/
- 18. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
- 19. http://archives.postgresql.org/pgsql-general/
- 20. http://openfts.sourceforge.net/
+ 1. http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme.html
+ 2. http://snowball.tartarus.org/
+ 3. http://openfts.sourceforge.net/
+ 4. file://localhost/u/megera/WWW/postgres/gist/tsearch/V2/docs/Tsearch_V2_Readme82.html#dm
+ 5. http:www.jfg-networks.com/
+ 6. http://www.sai.msu.su/~megera/postgres/gist
+ 7. http://www.sigaev.ru/gin/
+ 8. http://openfts.sourceforge.net/
+ 9. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
+ 10. http://www.sai.msu.su/~megera/wiki/Gendict
+ 11. http://www.sai.msu.su/~megera/wiki/Tsearch2
+ 12. http://sourceforge.net/mailarchive/forum.php?forum=openfts-general
+ 13. http://archives.postgresql.org/pgsql-general/