diff options
author | Thomas Munro <tmunro@postgresql.org> | 2023-03-23 12:39:43 +1300 |
---|---|---|
committer | Thomas Munro <tmunro@postgresql.org> | 2023-03-23 13:14:25 +1300 |
commit | 8fba928fd78856712f69d96852f8061e77390fda (patch) | |
tree | bad0cda018a6e277f7b4d92180edf55ea3eac057 /src/backend/executor/nodeHashjoin.c | |
parent | 11470f544e3729c60fab890145b2e839cbc8905e (diff) | |
download | postgresql-8fba928fd78856712f69d96852f8061e77390fda.tar.gz postgresql-8fba928fd78856712f69d96852f8061e77390fda.zip |
Improve the naming of Parallel Hash Join phases.
* Commit 3048898e dropped -ING from PHJ wait event names. Update the
corresponding barrier phases names to match.
* Rename the "DONE" phases to "FREE". That's symmetrical with
"ALLOCATE", and names the activity that actually happens in that phase
(as we do for the other phases) rather than a state. The bug fixed by
commit 8d578b9b might have been more obvious with this name.
* Rename the batch/bucket growth barriers' "ALLOCATE" phases to
"REALLOCATE", a better description of what they do.
* Update the high level comments about phases to highlight phases
are executed by a single process with an asterisk (mostly memory
management phases).
No behavior change, as this is just improving internal identifiers. The
only user-visible sign of this is that a couple of wait events' display
names change from "...Allocate" to "...Reallocate" in pg_stat_activity,
to stay in sync with the internal names.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKG%2BMDpwF2Eo2LAvzd%3DpOh81wUTsrwU1uAwR-v6OGBB6%2B7g%40mail.gmail.com
Diffstat (limited to 'src/backend/executor/nodeHashjoin.c')
-rw-r--r-- | src/backend/executor/nodeHashjoin.c | 90 |
1 files changed, 46 insertions, 44 deletions
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c index 32f12fefd7c..f189fb4d287 100644 --- a/src/backend/executor/nodeHashjoin.c +++ b/src/backend/executor/nodeHashjoin.c @@ -39,27 +39,30 @@ * * One barrier called build_barrier is used to coordinate the hashing phases. * The phase is represented by an integer which begins at zero and increments - * one by one, but in the code it is referred to by symbolic names as follows: + * one by one, but in the code it is referred to by symbolic names as follows. + * An asterisk indicates a phase that is performed by a single arbitrarily + * chosen process. * - * PHJ_BUILD_ELECTING -- initial state - * PHJ_BUILD_ALLOCATING -- one sets up the batches and table 0 - * PHJ_BUILD_HASHING_INNER -- all hash the inner rel - * PHJ_BUILD_HASHING_OUTER -- (multi-batch only) all hash the outer - * PHJ_BUILD_RUNNING -- building done, probing can begin - * PHJ_BUILD_DONE -- all work complete, one frees batches + * PHJ_BUILD_ELECT -- initial state + * PHJ_BUILD_ALLOCATE* -- one sets up the batches and table 0 + * PHJ_BUILD_HASH_INNER -- all hash the inner rel + * PHJ_BUILD_HASH_OUTER -- (multi-batch only) all hash the outer + * PHJ_BUILD_RUN -- building done, probing can begin + * PHJ_BUILD_FREE* -- all work complete, one frees batches * - * While in the phase PHJ_BUILD_HASHING_INNER a separate pair of barriers may + * While in the phase PHJ_BUILD_HASH_INNER a separate pair of barriers may * be used repeatedly as required to coordinate expansions in the number of * batches or buckets. Their phases are as follows: * - * PHJ_GROW_BATCHES_ELECTING -- initial state - * PHJ_GROW_BATCHES_ALLOCATING -- one allocates new batches - * PHJ_GROW_BATCHES_REPARTITIONING -- all repartition - * PHJ_GROW_BATCHES_FINISHING -- one cleans up, detects skew + * PHJ_GROW_BATCHES_ELECT -- initial state + * PHJ_GROW_BATCHES_REALLOCATE* -- one allocates new batches + * PHJ_GROW_BATCHES_REPARTITION -- all repartition + * PHJ_GROW_BATCHES_DECIDE* -- one detects skew and cleans up + * PHJ_GROW_BATCHES_FINISH -- finished one growth cycle * - * PHJ_GROW_BUCKETS_ELECTING -- initial state - * PHJ_GROW_BUCKETS_ALLOCATING -- one allocates new buckets - * PHJ_GROW_BUCKETS_REINSERTING -- all insert tuples + * PHJ_GROW_BUCKETS_ELECT -- initial state + * PHJ_GROW_BUCKETS_REALLOCATE* -- one allocates new buckets + * PHJ_GROW_BUCKETS_REINSERT -- all insert tuples * * If the planner got the number of batches and buckets right, those won't be * necessary, but on the other hand we might finish up needing to expand the @@ -67,27 +70,27 @@ * within our memory budget and load factor target. For that reason it's a * separate pair of barriers using circular phases. * - * The PHJ_BUILD_HASHING_OUTER phase is required only for multi-batch joins, + * The PHJ_BUILD_HASH_OUTER phase is required only for multi-batch joins, * because we need to divide the outer relation into batches up front in order * to be able to process batches entirely independently. In contrast, the * parallel-oblivious algorithm simply throws tuples 'forward' to 'later' * batches whenever it encounters them while scanning and probing, which it * can do because it processes batches in serial order. * - * Once PHJ_BUILD_RUNNING is reached, backends then split up and process + * Once PHJ_BUILD_RUN is reached, backends then split up and process * different batches, or gang up and work together on probing batches if there * aren't enough to go around. For each batch there is a separate barrier * with the following phases: * - * PHJ_BATCH_ELECTING -- initial state - * PHJ_BATCH_ALLOCATING -- one allocates buckets - * PHJ_BATCH_LOADING -- all load the hash table from disk - * PHJ_BATCH_PROBING -- all probe - * PHJ_BATCH_DONE -- end + * PHJ_BATCH_ELECT -- initial state + * PHJ_BATCH_ALLOCATE* -- one allocates buckets + * PHJ_BATCH_LOAD -- all load the hash table from disk + * PHJ_BATCH_PROBE -- all probe + * PHJ_BATCH_FREE* -- one frees memory * * Batch 0 is a special case, because it starts out in phase - * PHJ_BATCH_PROBING; populating batch 0's hash table is done during - * PHJ_BUILD_HASHING_INNER so we can skip loading. + * PHJ_BATCH_PROBE; populating batch 0's hash table is done during + * PHJ_BUILD_HASH_INNER so we can skip loading. * * Initially we try to plan for a single-batch hash join using the combined * hash_mem of all participants to create a large shared hash table. If that @@ -99,8 +102,8 @@ * finished. Practically, that means that we never emit a tuple while attached * to a barrier, unless the barrier has reached a phase that means that no * process will wait on it again. We emit tuples while attached to the build - * barrier in phase PHJ_BUILD_RUNNING, and to a per-batch barrier in phase - * PHJ_BATCH_PROBING. These are advanced to PHJ_BUILD_DONE and PHJ_BATCH_DONE + * barrier in phase PHJ_BUILD_RUN, and to a per-batch barrier in phase + * PHJ_BATCH_PROBE. These are advanced to PHJ_BUILD_FREE and PHJ_BATCH_FREE * respectively without waiting, using BarrierArriveAndDetach(). The last to * detach receives a different return value so that it knows that it's safe to * clean up. Any straggler process that attaches after that phase is reached @@ -306,13 +309,12 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel) if (parallel) { /* - * Advance the build barrier to PHJ_BUILD_RUNNING - * before proceeding so we can negotiate resource - * cleanup. + * Advance the build barrier to PHJ_BUILD_RUN before + * proceeding so we can negotiate resource cleanup. */ Barrier *build_barrier = ¶llel_state->build_barrier; - while (BarrierPhase(build_barrier) < PHJ_BUILD_RUNNING) + while (BarrierPhase(build_barrier) < PHJ_BUILD_RUN) BarrierArriveAndWait(build_barrier, 0); } return NULL; @@ -336,10 +338,10 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel) Barrier *build_barrier; build_barrier = ¶llel_state->build_barrier; - Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER || - BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING || - BarrierPhase(build_barrier) == PHJ_BUILD_DONE); - if (BarrierPhase(build_barrier) == PHJ_BUILD_HASHING_OUTER) + Assert(BarrierPhase(build_barrier) == PHJ_BUILD_HASH_OUTER || + BarrierPhase(build_barrier) == PHJ_BUILD_RUN || + BarrierPhase(build_barrier) == PHJ_BUILD_FREE); + if (BarrierPhase(build_barrier) == PHJ_BUILD_HASH_OUTER) { /* * If multi-batch, we need to hash the outer relation @@ -350,7 +352,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel) BarrierArriveAndWait(build_barrier, WAIT_EVENT_HASH_BUILD_HASH_OUTER); } - else if (BarrierPhase(build_barrier) == PHJ_BUILD_DONE) + else if (BarrierPhase(build_barrier) == PHJ_BUILD_FREE) { /* * If we attached so late that the job is finished and @@ -361,7 +363,7 @@ ExecHashJoinImpl(PlanState *pstate, bool parallel) } /* Each backend should now select a batch to work on. */ - Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUNNING); + Assert(BarrierPhase(build_barrier) == PHJ_BUILD_RUN); hashtable->curbatch = -1; node->hj_JoinState = HJ_NEED_NEW_BATCH; @@ -1153,7 +1155,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) switch (BarrierAttach(batch_barrier)) { - case PHJ_BATCH_ELECTING: + case PHJ_BATCH_ELECT: /* One backend allocates the hash table. */ if (BarrierArriveAndWait(batch_barrier, @@ -1161,13 +1163,13 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) ExecParallelHashTableAlloc(hashtable, batchno); /* Fall through. */ - case PHJ_BATCH_ALLOCATING: + case PHJ_BATCH_ALLOCATE: /* Wait for allocation to complete. */ BarrierArriveAndWait(batch_barrier, WAIT_EVENT_HASH_BATCH_ALLOCATE); /* Fall through. */ - case PHJ_BATCH_LOADING: + case PHJ_BATCH_LOAD: /* Start (or join in) loading tuples. */ ExecParallelHashTableSetCurrentBatch(hashtable, batchno); inner_tuples = hashtable->batches[batchno].inner_tuples; @@ -1187,7 +1189,7 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) WAIT_EVENT_HASH_BATCH_LOAD); /* Fall through. */ - case PHJ_BATCH_PROBING: + case PHJ_BATCH_PROBE: /* * This batch is ready to probe. Return control to @@ -1197,13 +1199,13 @@ ExecParallelHashJoinNewBatch(HashJoinState *hjstate) * this barrier again (or else a deadlock could occur). * All attached participants must eventually call * BarrierArriveAndDetach() so that the final phase - * PHJ_BATCH_DONE can be reached. + * PHJ_BATCH_FREE can be reached. */ ExecParallelHashTableSetCurrentBatch(hashtable, batchno); sts_begin_parallel_scan(hashtable->batches[batchno].outer_tuples); return true; - case PHJ_BATCH_DONE: + case PHJ_BATCH_FREE: /* * Already done. Detach and go around again (if any @@ -1523,7 +1525,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt) /* * It would be possible to reuse the shared hash table in single-batch * cases by resetting and then fast-forwarding build_barrier to - * PHJ_BUILD_DONE and batch 0's batch_barrier to PHJ_BATCH_PROBING, but + * PHJ_BUILD_FREE and batch 0's batch_barrier to PHJ_BATCH_PROBE, but * currently shared hash tables are already freed by now (by the last * participant to detach from the batch). We could consider keeping it * around for single-batch joins. We'd also need to adjust @@ -1542,7 +1544,7 @@ ExecHashJoinReInitializeDSM(HashJoinState *state, ParallelContext *pcxt) /* Clear any shared batch files. */ SharedFileSetDeleteAll(&pstate->fileset); - /* Reset build_barrier to PHJ_BUILD_ELECTING so we can go around again. */ + /* Reset build_barrier to PHJ_BUILD_ELECT so we can go around again. */ BarrierInit(&pstate->build_barrier, 0); } |