Rotate instead of shifting hash join batch number.

Our algorithm for choosing batch numbers turned out not to work effectively for multi-billion key inner relations. We would use more hash bits than we have, and effectively concentrate all tuples into a smaller number of batches than we intended. While ideally we should switch to wider hashes, for now, change the algorithm to one that effectively gives up bits from the bucket number when we don't have enough bits. That means we'll finish up with longer bucket chains than would be ideal, but that's better than having batches that don't fit in work_mem and can't be divided. Batch-patch to all supported releases. Author: Thomas Munro Reviewed-by: Tom Lane, thanks also to Tomas Vondra, Alvaro Herrera, Andres Freund for testing and discussion Reported-by: James Coleman Discussion: https://postgr.es/m/16104-dc11ed911f1ab9df%40postgresql.org
author: Thomas Munro <tmunro@postgresql.org> 2019-12-24 11:31:24 +1300
committer: Thomas Munro <tmunro@postgresql.org> 2019-12-24 13:05:43 +1300
commit: e69d644547785cc9f079650d29118a3688bc5039 (patch)
tree: d7762c8e93743b87a3786a372c925aab46277111 /src/backend/executor/nodeHash.c
parent: d5b9c2baff662aac22cd2a497d5bcd3b5a916fd0 (diff)
download: postgresql-e69d644547785cc9f079650d29118a3688bc5039.tar.gz
postgresql-e69d644547785cc9f079650d29118a3688bc5039.zip
1 files changed, 9 insertions, 4 deletions
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index d6f4eda0977..568938667f9 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -37,6 +37,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_bitutils.h"
 #include "utils/dynahash.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
@@ -1877,7 +1878,7 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * chains), and must only cause the batch number to remain the same or
  * increase.  Our algorithm is
  *		bucketno = hashvalue MOD nbuckets
- *		batchno = (hashvalue DIV nbuckets) MOD nbatch
+ *		batchno = ROR(hashvalue, log2_nbuckets) MOD nbatch
  * where nbuckets and nbatch are both expected to be powers of 2, so we can
  * do the computations by shifting and masking.  (This assumes that all hash
  * functions are good about randomizing all their output bits, else we are
@@ -1889,7 +1890,11 @@ ExecHashGetHashValue(HashJoinTable hashtable,
  * number the way we do here).
  *
  * nbatch is always a power of 2; we increase it only by doubling it.  This
- * effectively adds one more bit to the top of the batchno.
+ * effectively adds one more bit to the top of the batchno.  In very large
+ * joins, we might run out of bits to add, so we do this by rotating the hash
+ * value.  This causes batchno to steal bits from bucketno when the number of
+ * virtual buckets exceeds 2^32.  It's better to have longer bucket chains
+ * than to lose the ability to divide batches.
  */
 void
 ExecHashGetBucketAndBatch(HashJoinTable hashtable,
@@ -1902,9 +1907,9 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 
 	if (nbatch > 1)
 	{
-		/* we can do MOD by masking, DIV by shifting */
 		*bucketno = hashvalue & (nbuckets - 1);
-		*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
+		*batchno = pg_rotate_right32(hashvalue,
+									 hashtable->log2_nbuckets) & (nbatch - 1);
 	}
 	else
 	{
author	Thomas Munro <tmunro@postgresql.org>	2019-12-24 11:31:24 +1300
committer	Thomas Munro <tmunro@postgresql.org>	2019-12-24 13:05:43 +1300
commit	e69d644547785cc9f079650d29118a3688bc5039 (patch)
tree	d7762c8e93743b87a3786a372c925aab46277111 /src/backend/executor/nodeHash.c
parent	d5b9c2baff662aac22cd2a497d5bcd3b5a916fd0 (diff)
download	postgresql-e69d644547785cc9f079650d29118a3688bc5039.tar.gz postgresql-e69d644547785cc9f079650d29118a3688bc5039.zip