diff options
author | Tom Lane <tgl@sss.pgh.pa.us> | 2011-02-17 19:01:01 -0500 |
---|---|---|
committer | Tom Lane <tgl@sss.pgh.pa.us> | 2011-02-17 19:01:01 -0500 |
commit | 2b3a0630b54ff9970a7cd2c78a686015f9a53c0c (patch) | |
tree | 5dfcc711292be33f0a8097cfa848f5c86ef22131 /src | |
parent | 42e663cc4139bba218efccfb53293cd6e6fa43da (diff) | |
download | postgresql-2b3a0630b54ff9970a7cd2c78a686015f9a53c0c.tar.gz postgresql-2b3a0630b54ff9970a7cd2c78a686015f9a53c0c.zip |
Fix tsmatchsel() to account properly for null rows.
ts_typanalyze.c computes MCE statistics as fractions of the non-null rows,
which seems fairly reasonable, and anyway changing it in released versions
wouldn't be a good idea. But then ts_selfuncs.c has to account for that.
Failure to do so results in overestimates in columns with a significant
fraction of null documents. Back-patch to 8.4 where this stuff was
introduced.
Jesper Krogh
Diffstat (limited to 'src')
-rw-r--r-- | src/backend/tsearch/ts_selfuncs.c | 6 | ||||
-rw-r--r-- | src/include/catalog/pg_statistic.h | 2 |
2 files changed, 8 insertions, 0 deletions
diff --git a/src/backend/tsearch/ts_selfuncs.c b/src/backend/tsearch/ts_selfuncs.c index 1f0a42d9b12..709d48c6178 100644 --- a/src/backend/tsearch/ts_selfuncs.c +++ b/src/backend/tsearch/ts_selfuncs.c @@ -188,11 +188,17 @@ tsquerysel(VariableStatData *vardata, Datum constval) /* No most-common-elements info, so do without */ selec = tsquery_opr_selec_no_stats(query); } + + /* + * MCE stats count only non-null rows, so adjust for null rows. + */ + selec *= (1.0 - stats->stanullfrac); } else { /* No stats at all, so do without */ selec = tsquery_opr_selec_no_stats(query); + /* we assume no nulls here, so no stanullfrac correction */ } return selec; diff --git a/src/include/catalog/pg_statistic.h b/src/include/catalog/pg_statistic.h index 0e831ef2982..edccf254b1b 100644 --- a/src/include/catalog/pg_statistic.h +++ b/src/include/catalog/pg_statistic.h @@ -244,6 +244,8 @@ typedef FormData_pg_statistic *Form_pg_statistic; * type with identifiable elements (for instance, tsvector). staop contains * the equality operator appropriate to the element type. stavalues contains * the most common element values, and stanumbers their frequencies. Unlike + * MCV slots, frequencies are measured as the fraction of non-null rows the + * element value appears in, not the frequency of all rows. Also unlike * MCV slots, the values are sorted into order (to support binary search * for a particular value). Since this puts the minimum and maximum * frequencies at unpredictable spots in stanumbers, there are two extra |