Detect redundant GROUP BY columns using UNIQUE indexes

d4c3a156c added support that when the GROUP BY contained all of the columns belonging to a relation's PRIMARY KEY, all other columns belonging to that relation would be removed from the GROUP BY clause. That's possible because all other columns are functionally dependent on the PRIMARY KEY and those columns alone ensure the groups are distinct. Here we expand on that optimization and allow it to work for any unique indexes on the table rather than just the PRIMARY KEY index. This normally requires that all columns in the index are defined with NOT NULL, however, we can relax that requirement when the index is defined with NULLS NOT DISTINCT. When there are multiple suitable indexes to allow columns to be removed, we prefer the index with the least number of columns as this allows us to remove the highest number of GROUP BY columns. One day, we may want to revisit that decision as it may make more sense to use the narrower set of columns in terms of the width of the data types and stored/queried data. This also adjusts the code to make use of RelOptInfo.indexlist rather than looking up the catalog tables. In passing, add another short-circuit path to allow bailing out earlier in cases where it's certainly not possible to remove redundant GROUP BY columns. This early exit is now cheaper to do than when this code was originally written as 00b41463c made it cheaper to check for empty Bitmapsets. Patch originally by Zhang Mingli and later worked on by jian he, but after I (David) worked on it, there was very little of the original left. Author: Zhang Mingli, jian he, David Rowley Reviewed-by: jian he, Andrei Lepikhov Discussion: https://postgr.es/m/327990c8-b9b2-4b0c-bffb-462249f82de0%40Spark
author: David Rowley <drowley@postgresql.org> 2024-12-12 15:28:38 +1300
committer: David Rowley <drowley@postgresql.org> 2024-12-12 15:28:38 +1300
commit: bd10ec529796a13670645e6acd640c6f290df020 (patch)
tree: c358ed55e836e2ff46d508f25776ff9c1175af64 /src/backend/optimizer/plan
parent: d8f335156c57f0df6ae3b1ec31e55979838eb882 (diff)
download: postgresql-bd10ec529796a13670645e6acd640c6f290df020.tar.gz
postgresql-bd10ec529796a13670645e6acd640c6f290df020.zip
1 files changed, 98 insertions, 22 deletions
diff --git a/src/backend/optimizer/plan/initsplan.c b/src/backend/optimizer/plan/initsplan.c
index c1c4f9f8b9d..5f3908be519 100644
--- a/src/backend/optimizer/plan/initsplan.c
+++ b/src/backend/optimizer/plan/initsplan.c
@@ -400,17 +400,13 @@ add_vars_to_attr_needed(PlannerInfo *root, List *vars,
  *
  * Since some other DBMSes do not allow references to ungrouped columns, it's
  * not unusual to find all columns listed in GROUP BY even though listing the
- * primary-key columns would be sufficient.  Deleting such excess columns
- * avoids redundant sorting work, so it's worth doing.
+ * primary-key columns, or columns of a unique constraint would be sufficient.
+ * Deleting such excess columns avoids redundant sorting or hashing work, so
+ * it's worth doing.
  *
  * Relcache invalidations will ensure that cached plans become invalidated
- * when the underlying index of the pkey constraint is dropped.
- *
- * Currently, we only make use of pkey constraints for this, however, we may
- * wish to take this further in the future and also use unique constraints
- * which have NOT NULL columns.  In that case, plan invalidation will still
- * work since relations will receive a relcache invalidation when a NOT NULL
- * constraint is dropped.
+ * when the underlying supporting indexes are dropped or if a column's NOT
+ * NULL attribute is removed.
  */
 void
 remove_useless_groupby_columns(PlannerInfo *root)
@@ -418,6 +414,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	Query	   *parse = root->parse;
 	Bitmapset **groupbyattnos;
 	Bitmapset **surplusvars;
+	bool		tryremove = false;
 	ListCell   *lc;
 	int			relid;
 
@@ -457,11 +454,25 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		/* OK, remember we have this Var */
 		relid = var->varno;
 		Assert(relid <= list_length(parse->rtable));
+
+		/*
+		 * If this isn't the first column for this relation then we now have
+		 * multiple columns.  That means there might be some that can be
+		 * removed.
+		 */
+		tryremove |= !bms_is_empty(groupbyattnos[relid]);
 		groupbyattnos[relid] = bms_add_member(groupbyattnos[relid],
 											  var->varattno - FirstLowInvalidHeapAttributeNumber);
 	}
 
 	/*
+	 * No Vars or didn't find multiple Vars for any relation in the GROUP BY?
+	 * If so, nothing can be removed, so don't waste more effort trying.
+	 */
+	if (!tryremove)
+		return;
+
+	/*
 	 * Consider each relation and see if it is possible to remove some of its
 	 * Vars from GROUP BY.  For simplicity and speed, we do the actual removal
 	 * in a separate pass.  Here, we just fill surplusvars[k] with a bitmapset
@@ -472,9 +483,10 @@ remove_useless_groupby_columns(PlannerInfo *root)
 	foreach(lc, parse->rtable)
 	{
 		RangeTblEntry *rte = lfirst_node(RangeTblEntry, lc);
+		RelOptInfo *rel;
 		Bitmapset  *relattnos;
-		Bitmapset  *pkattnos;
-		Oid			constraintOid;
+		Bitmapset  *best_keycolumns = NULL;
+		int32		best_nkeycolumns = PG_INT32_MAX;
 
 		relid++;
 
@@ -495,19 +507,83 @@ remove_useless_groupby_columns(PlannerInfo *root)
 		if (bms_membership(relattnos) != BMS_MULTIPLE)
 			continue;
 
-		/*
-		 * Can't remove any columns for this rel if there is no suitable
-		 * (i.e., nondeferrable) primary key constraint.
-		 */
-		pkattnos = get_primary_key_attnos(rte->relid, false, &constraintOid);
-		if (pkattnos == NULL)
-			continue;
+		rel = root->simple_rel_array[relid];
 
 		/*
-		 * If the primary key is a proper subset of relattnos then we have
-		 * some items in the GROUP BY that can be removed.
+		 * Now check each index for this relation to see if there are any with
+		 * columns which are a proper subset of the grouping columns for this
+		 * relation.
 		 */
-		if (bms_subset_compare(pkattnos, relattnos) == BMS_SUBSET1)
+		foreach_node(IndexOptInfo, index, rel->indexlist)
+		{
+			Bitmapset  *ind_attnos;
+			bool		nulls_check_ok;
+
+			/*
+			 * Skip any non-unique and deferrable indexes.  Predicate indexes
+			 * have not been checked yet, so we must skip those too as the
+			 * predOK check that's done later might fail.
+			 */
+			if (!index->unique || !index->immediate || index->indpred != NIL)
+				continue;
+
+			/* For simplicity, we currently don't support expression indexes */
+			if (index->indexprs != NIL)
+				continue;
+
+			ind_attnos = NULL;
+			nulls_check_ok = true;
+			for (int i = 0; i < index->nkeycolumns; i++)
+			{
+				/*
+				 * We must insist that the index columns are all defined NOT
+				 * NULL otherwise duplicate NULLs could exist.  However, we
+				 * can relax this check when the index is defined with NULLS
+				 * NOT DISTINCT as there can only be 1 NULL row, therefore
+				 * functional dependency on the unique columns is maintained,
+				 * despite the NULL.
+				 */
+				if (!index->nullsnotdistinct &&
+					!bms_is_member(index->indexkeys[i],
+								   rel->notnullattnums))
+				{
+					nulls_check_ok = false;
+					break;
+				}
+
+				ind_attnos =
+					bms_add_member(ind_attnos,
+								   index->indexkeys[i] -
+								   FirstLowInvalidHeapAttributeNumber);
+			}
+
+			if (!nulls_check_ok)
+				continue;
+
+			/*
+			 * Skip any indexes where the indexed columns aren't a proper
+			 * subset of the GROUP BY.
+			 */
+			if (bms_subset_compare(ind_attnos, relattnos) != BMS_SUBSET1)
+				continue;
+
+			/*
+			 * Record the attribute numbers from the index with the fewest
+			 * columns.  This allows the largest number of columns to be
+			 * removed from the GROUP BY clause.  In the future, we may wish
+			 * to consider using the narrowest set of columns and looking at
+			 * pg_statistic.stawidth as it might be better to use an index
+			 * with, say two INT4s, rather than, say, one long varlena column.
+			 */
+			if (index->nkeycolumns < best_nkeycolumns)
+			{
+				best_keycolumns = ind_attnos;
+				best_nkeycolumns = index->nkeycolumns;
+			}
+		}
+
+		/* Did we find a suitable index? */
+		if (!bms_is_empty(best_keycolumns))
 		{
 			/*
 			 * To easily remember whether we've found anything to do, we don't
@@ -518,7 +594,7 @@ remove_useless_groupby_columns(PlannerInfo *root)
 													 (list_length(parse->rtable) + 1));
 
 			/* Remember the attnos of the removable columns */
-			surplusvars[relid] = bms_difference(relattnos, pkattnos);
+			surplusvars[relid] = bms_difference(relattnos, best_keycolumns);
 		}
 	}
author	David Rowley <drowley@postgresql.org>	2024-12-12 15:28:38 +1300
committer	David Rowley <drowley@postgresql.org>	2024-12-12 15:28:38 +1300
commit	bd10ec529796a13670645e6acd640c6f290df020 (patch)
tree	c358ed55e836e2ff46d508f25776ff9c1175af64 /src/backend/optimizer/plan
parent	d8f335156c57f0df6ae3b1ec31e55979838eb882 (diff)
download	postgresql-bd10ec529796a13670645e6acd640c6f290df020.tar.gz postgresql-bd10ec529796a13670645e6acd640c6f290df020.zip