aboutsummaryrefslogtreecommitdiff
path: root/src/backend/utils/adt/pg_lzcompress.c
diff options
context:
space:
mode:
authorJan Wieck <JanWieck@Yahoo.com>2000-07-06 21:02:07 +0000
committerJan Wieck <JanWieck@Yahoo.com>2000-07-06 21:02:07 +0000
commitb027ad9a7a95fb14b62b73b507d1a8c7d6f5e940 (patch)
tree69698d27f6e1e31a367bf3acd21ffacd8aee8d18 /src/backend/utils/adt/pg_lzcompress.c
parent43f6ab86546931eee3cb2735953820efdb95a2f7 (diff)
downloadpostgresql-b027ad9a7a95fb14b62b73b507d1a8c7d6f5e940.tar.gz
postgresql-b027ad9a7a95fb14b62b73b507d1a8c7d6f5e940.zip
Added comments about the compression algorithm as requested by Tom
Jan
Diffstat (limited to 'src/backend/utils/adt/pg_lzcompress.c')
-rw-r--r--src/backend/utils/adt/pg_lzcompress.c62
1 files changed, 58 insertions, 4 deletions
diff --git a/src/backend/utils/adt/pg_lzcompress.c b/src/backend/utils/adt/pg_lzcompress.c
index 493b4e64465..0c258c0aae3 100644
--- a/src/backend/utils/adt/pg_lzcompress.c
+++ b/src/backend/utils/adt/pg_lzcompress.c
@@ -1,7 +1,7 @@
/* ----------
* pg_lzcompress.c -
*
- * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.6 2000/07/03 23:09:52 wieck Exp $
+ * $Header: /cvsroot/pgsql/src/backend/utils/adt/pg_lzcompress.c,v 1.7 2000/07/06 21:02:07 wieck Exp $
*
* This is an implementation of LZ compression for PostgreSQL.
* It uses a simple history table and generates 2-3 byte tags
@@ -46,7 +46,7 @@
* The return value is the size of bytes written to buff.
* Obviously the same as PGLZ_RAW_SIZE() returns.
*
- * The compression algorithm and internal data format:
+ * The decompression algorithm and internal data format:
*
* PGLZ_Header is defined as
*
@@ -57,8 +57,8 @@
*
* The header is followed by the compressed data itself.
*
- * The algorithm is easiest explained by describing the process
- * of decompression.
+ * The data representation is easiest explained by describing
+ * the process of decompression.
*
* If varsize == rawsize + sizeof(PGLZ_Header), then the data
* is stored uncompressed as plain bytes. Thus, the decompressor
@@ -108,6 +108,60 @@
* and end up with a total compression rate of 96%, what's still
* worth a Whow.
*
+ * The compression algorithm
+ *
+ * The following uses numbers used in the default strategy.
+ *
+ * The compressor works best for attributes of a size between
+ * 1K and 1M. For smaller items there's not that much chance of
+ * redundancy in the character sequence (except for large areas
+ * of identical bytes like trailing spaces) and for bigger ones
+ * the allocation of the history table is expensive (it needs
+ * 8 times the size of the input!).
+ *
+ * The compressor creates a table for 8192 lists of positions.
+ * For each input position (except the last 3), a hash key is
+ * built from the 4 next input bytes and the posiiton remembered
+ * in the appropriate list. Thus, the table points to linked
+ * lists of likely to be at least in the first 4 characters
+ * matching strings. This is done on the fly while the input
+ * is compressed into the output area.
+ *
+ * For each byte in the input, it's hash key (built from this
+ * byte and the next 3) is used to find the appropriate list
+ * in the table. The lists remember the positions of all bytes
+ * that had the same hash key in the past in increasing backward
+ * offset order. Now for all entries in the used lists, the
+ * match length is computed by comparing the characters from the
+ * entries position with the characters from the actual input
+ * position.
+ *
+ * The compressor starts with a so called "good_match" of 128.
+ * It is a "prefer speed against compression ratio" optimizer.
+ * So if the first entry looked at already has 128 or more
+ * matching characters, the lookup stops and that position is
+ * used for the next tag in the output.
+ *
+ * For each subsequent entry in the history list, the "good_match"
+ * is lowered by 10%. So the compressor will be more happy with
+ * short matches the farer it has to go back in the history.
+ * Another "speed against ratio" preference characteristic of
+ * the algorithm.
+ *
+ * Thus there are 3 stop conditions for the lookup of matches:
+ *
+ * - a match >= good_match is found
+ * - there are no more history entries to look at
+ * - the next history entry is already too far back
+ * to be coded into a tag.
+ *
+ * Finally the match algorithm checks that at least a match
+ * of 3 or more bytes has been found, because thats the smallest
+ * amount of copy information to code into a tag. If so, a tag
+ * is omitted and all the input bytes covered by that are just
+ * scanned for the history add's, otherwise a literal character
+ * is omitted and only his history entry added.
+ *
* Acknowledgements:
*
* Many thanks to Adisak Pochanayon, who's article about SLZ