Fix misestimation of n_distinct for a nearly-unique column with many nulls.

If ANALYZE found no repeated non-null entries in its sample, it set the column's stadistinct value to -1.0, intending to indicate that the entries are all distinct. But what this value actually means is that the number of distinct values is 100% of the table's rowcount, and thus it was overestimating the number of distinct values by however many nulls there are. This could lead to very poor selectivity estimates, as for example in a recent report from Andreas Joseph Krogh. We should discount the stadistinct value by whatever we've estimated the nulls fraction to be. (That is what will happen if we choose to use a negative stadistinct for a column that does have repeated entries, so this code path was just inconsistent.) In addition to fixing the stadistinct entries stored by several different ANALYZE code paths, adjust the logic where get_variable_numdistinct() forces an "all distinct" estimate on the basis of finding a relevant unique index. Unique indexes don't reject nulls, so there's no reason to assume that the null fraction doesn't apply. Back-patch to all supported branches. Back-patching is a bit of a judgment call, but this problem seems to affect only a few users (else we'd have identified it long ago), and it's bad enough when it does happen that destabilizing plan choices in a worse direction seems unlikely. Patch by me, with documentation wording suggested by Dean Rasheed Report: <VisenaEmail.26.df42f82acae38a58.156463942b8@tc7-visena> Discussion: <16143.1470350371@sss.pgh.pa.us>
author: Tom Lane <tgl@sss.pgh.pa.us> 2016-08-07 18:52:02 -0400
committer: Tom Lane <tgl@sss.pgh.pa.us> 2016-08-07 18:52:02 -0400
commit: 95bee941be4c009ebbc29162a0dc9664f40de12f (patch)
tree: 97557258e66828f2075dbf6281d416c53cde4621 /src/backend/commands/analyze.c
parent: 8a8c6b53810026641a1e12f60f873a7bd3cea5e3 (diff)
download: postgresql-95bee941be4c009ebbc29162a0dc9664f40de12f.tar.gz
postgresql-95bee941be4c009ebbc29162a0dc9664f40de12f.zip
1 files changed, 11 insertions, 5 deletions
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 5fcedd78554..9ac71220a2a 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2049,8 +2049,11 @@ compute_distinct_stats(VacAttrStatsP stats,
 
 		if (nmultiple == 0)
 		{
-			/* If we found no repeated values, assume it's a unique column */
-			stats->stadistinct = -1.0;
+			/*
+			 * If we found no repeated non-null values, assume it's a unique
+			 * column; but be sure to discount for any nulls we found.
+			 */
+			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
 		}
 		else if (track_cnt < track_max && toowide_cnt == 0 &&
 				 nmultiple == track_cnt)
@@ -2426,8 +2429,11 @@ compute_scalar_stats(VacAttrStatsP stats,
 
 		if (nmultiple == 0)
 		{
-			/* If we found no repeated values, assume it's a unique column */
-			stats->stadistinct = -1.0;
+			/*
+			 * If we found no repeated non-null values, assume it's a unique
+			 * column; but be sure to discount for any nulls we found.
+			 */
+			stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
 		}
 		else if (toowide_cnt == 0 && nmultiple == ndistinct)
 		{
@@ -2753,7 +2759,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 		else
 			stats->stawidth = stats->attrtype->typlen;
 		/* Assume all too-wide values are distinct, so it's a unique column */
-		stats->stadistinct = -1.0;
+		stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);
 	}
 	else if (null_cnt > 0)
 	{
author	Tom Lane <tgl@sss.pgh.pa.us>	2016-08-07 18:52:02 -0400
committer	Tom Lane <tgl@sss.pgh.pa.us>	2016-08-07 18:52:02 -0400
commit	95bee941be4c009ebbc29162a0dc9664f40de12f (patch)
tree	97557258e66828f2075dbf6281d416c53cde4621 /src/backend/commands/analyze.c
parent	8a8c6b53810026641a1e12f60f873a7bd3cea5e3 (diff)
download	postgresql-95bee941be4c009ebbc29162a0dc9664f40de12f.tar.gz postgresql-95bee941be4c009ebbc29162a0dc9664f40de12f.zip