aboutsummaryrefslogtreecommitdiff
path: root/src/backend/executor
Commit message (Collapse)AuthorAge
* Refactor ExecScan() to allow inlining of its core logicAmit Langote2025-01-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit refactors ExecScan() by moving its tuple-fetching, filtering, and projection logic into an inline-able function, ExecScanExtended(), defined in src/include/executor/execScan.h. ExecScanExtended() accepts parameters for EvalPlanQual state, qualifiers (ExprState), and projection (ProjectionInfo). Specialized variants of the execution function of a given Scan node (for example, ExecSeqScan() for SeqScan) can then pass const-NULL for unused parameters. This allows the compiler to inline the logic and eliminate unnecessary branches or checks. Each variant function thus contains only the necessary code, optimizing execution for scans where these features are not needed. The variant function to be used is determined in the ExecInit*() function of the node and assigned to the ExecProcNode function pointer in the node's PlanState, effectively turning runtime checks and conditional branches on the NULLness of epqstate, qual, and projInfo into static ones, provided the compiler successfully eliminates unnecessary checks from the inlined code of ExecScanExtended(). Currently, only ExecSeqScan() is modified to take advantage of this inline-ability. Other Scan nodes might benefit from such specialized variant functions but that is left as future work. Benchmarks performed by Junwang Zhao, David Rowley and myself show up to a 5% reduction in execution time for queries that rely heavily on Seq Scans. The most significant improvements were observed in scenarios where EvalPlanQual, qualifiers, and projection were not required, but other cases also benefit from reduced runtime overhead due to the inlining and removal of unnecessary code paths. The idea for this patch first came from Andres Freund in an off-list discussion. The refactoring approach implemented here is based on a proposal by David Rowley, significantly improving upon the patch I (amitlan) initially proposed. Suggested-by: Andres Freund <andres@anarazel.de> Co-authored-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Junwang Zhao <zhjwpku@gmail.com> Tested-by: Junwang Zhao <zhjwpku@gmail.com> Tested-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CA+HiwqGaH-otvqW_ce-paL=96JvU4j+Xbuk+14esJNDwefdkOg@mail.gmail.com
* Add OLD/NEW support to RETURNING in DML queries.Dean Rasheed2025-01-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows the RETURNING list of INSERT/UPDATE/DELETE/MERGE queries to explicitly return old and new values by using the special aliases "old" and "new", which are automatically added to the query (if not already defined) while parsing its RETURNING list, allowing things like: RETURNING old.colname, new.colname, ... RETURNING old.*, new.* Additionally, a new syntax is supported, allowing the names "old" and "new" to be changed to user-supplied alias names, e.g.: RETURNING WITH (OLD AS o, NEW AS n) o.colname, n.colname, ... This is useful when the names "old" and "new" are already defined, such as inside trigger functions, allowing backwards compatibility to be maintained -- the interpretation of any existing queries that happen to already refer to relations called "old" or "new", or use those as aliases for other relations, is not changed. For an INSERT, old values will generally be NULL, and for a DELETE, new values will generally be NULL, but that may change for an INSERT with an ON CONFLICT ... DO UPDATE clause, or if a query rewrite rule changes the command type. Therefore, we put no restrictions on the use of old and new in any DML queries. Dean Rasheed, reviewed by Jian He and Jeff Davis. Discussion: https://postgr.es/m/CAEZATCWx0J0-v=Qjc6gXzR=KtsdvAE7Ow=D=mu50AgOe+pvisQ@mail.gmail.com
* Change gist stratnum function to use CompareTypePeter Eisentraut2025-01-15
| | | | | | | | | | | | | This changes commit 7406ab623fe in that the gist strategy number mapping support function is changed to use the CompareType enum as input, instead of the "well-known" RT*StrategyNumber strategy numbers. This is a bit cleaner, since you are not dealing with two sets of strategy numbers. Also, this will enable us to subsume this system into a more general system of using CompareType to define operator semantics across index methods. Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Rename RowCompareType to CompareTypePeter Eisentraut2025-01-15
| | | | | | | | | | | | | | | | | RowCompareType served as a way to describe the fundamental meaning of an operator, notionally independent of an operator class (although so far this was only really supported for btrees). Its original purpose was for use inside RowCompareExpr, and it has also found some small use outside, such as for get_op_btree_interpretation(). We want to expand this now, as a more general way to describe operator semantics for other index access methods, including gist (to improve GistTranslateStratnum()) and others not written yet. To avoid future confusion, we rename the type to CompareType and the symbols from ROWCOMPARE_XXX to COMPARE_XXX to reflect their more general purpose. Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Revert "TupleHashTable: store additional data along with tuple."Jeff Davis2025-01-13
| | | | | | | This reverts commit e0ece2a981ee9068f50c4423e303836c2585eb02 due to performance regressions. Reported-by: David Rowley
* Fix pgindent damageRichard Guo2025-01-13
| | | | Oversight in commit e0ece2a98.
* Add support for NOT ENFORCED in CHECK constraintsPeter Eisentraut2025-01-11
| | | | | | | | | | | | | | | | | | | This adds support for the NOT ENFORCED/ENFORCED flag for constraints, with support for check constraints. The plan is to eventually support this for foreign key constraints, where it is typically more useful. Note that CHECK constraints do not currently support ALTER operations, so changing the enforceability of an existing constraint isn't possible without dropping and recreating it. This could be added later. Author: Amul Sul <amul.sul@enterprisedb.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Tested-by: Triveni N <triveni.n@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
* Fix redefinition of type in commit e0ece2a981.Jeff Davis2025-01-10
|
* TupleHashTable: store additional data along with tuple.Jeff Davis2025-01-10
| | | | | | | | | | | | | Previously, the caller needed to allocate the memory and the TupleHashTable would store a pointer to it. That wastes space for the palloc overhead as well as the size of the pointer itself. Now, the TupleHashTable relies on the caller to correctly specify the additionalsize, and allocates that amount of space. The caller can then request a pointer into that space. Discussion: https://postgr.es/m/b9cbf0219a9859dc8d240311643ff4362fd9602c.camel@j-davis.com Reviewed-by: Heikki Linnakangas
* ExecInitAgg: update aggstate->numaggs and ->numtrans earlier.Jeff Davis2025-01-07
| | | | | | | | | Functions hash_agg_entry_size() and build_hash_tables() make use of those values for memory size estimates. Because this change only affects memory estimates, don't backpatch. Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
* nodeSetOp.c: missing additionalsize for BuildTupleHashTable().Jeff Davis2025-01-07
| | | | | | | | Provide additionalsize argument, which can affect the calculations for 'nbuckets'. Also, future work for Hash Aggregation will rely on the correct additionalsize. Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
* Remove unused TupleHashTableData->entrysize.Jeff Davis2025-01-07
| | | | Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
* Fix outdated CHUNKHDRSZ value in nodeAgg.cDavid Rowley2025-01-02
| | | | | | | | | | | | | | CHUNKHDRSZ was defined as 16 bytes, which was true when that code went in, but since c6e0fe1f2, 8 is a more accurate value. Here we adjust it to use sizeof(MemoryChunk), which is normally 8, or 16 for cassert builds. c6e0fe1f2 first appeared in v16, so this is technically wrong in v16 up to master, but let's apply this only to master as adjusting this does influence the estimated number of batches in the aggregate costing code and we don't want to cause plan instability in released versions. Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAApHDvpMpRQvsTqZo3FinXkgytwxwF8sCyZm83xDj-1s_hLe+w@mail.gmail.com
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Speedup tuple deformation with additional function inliningDavid Rowley2024-12-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adjusts slot_deform_heap_tuple() to add special-case loops to eliminate much of the branching that was done within the body of the main deform loop. Previously, while looping over each attribute to deform, slot_deform_heap_tuple() would always recheck if the given attribute was NULL by looking at HeapTupleHasNulls() and if so, went on to check the tuple's NULL bitmap. Since many tuples won't contain any NULLs, we can just check HeapTupleHasNulls() once and when there are no NULLs, use a more compact version of the deforming loop which contains no NULL checking code at all. The same is possible for the "slow" mode checking part of the loop. That variable was checked several times for each attribute, once to determine if the offset to the attribute value could be taken from the attcacheoff, and again to check if the offset could be cached for next time. These "slow" checks can mostly be eliminated by instead having multiple loops. Initially, we can start in the non-slow loop and break out of that loop if and only if we must stop caching the offset. This eliminates branching for both slow and non-slow deforming methods. The amount of code required for the no nulls / non-slow version is very small. It's possible to have separate loops like this due to the fact that once we move into slow mode, we never need to switch back into non-slow mode for a given tuple. We have the compiler take care of writing out the multiple required loops by having a pg_attribute_always_inline function which gets called various times passing in constant values for the "slow" and "hasnulls" parameters. This allows the compiler to eliminate const-false branches and remove comparisons for const-true ones. This commit has shown overall query performance increases of around 5-20% in deform-heavy OLAP-type workloads. Author: David Rowley Reviewed-by: Victor Yegorov Discussion: https://postgr.es/m/CAGnEbog92Og2CpC2S8=g_HozGsWtt_3kRS1sXjLz0jKSoCNfLw@mail.gmail.com Discussion: https://postgr.es/m/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com
* Optimize alignment calculations in tuple form/deformDavid Rowley2024-12-21
| | | | | | | | | | | | | | | | | | | | | Here we convert CompactAttribute.attalign from a char, which is directly derived from pg_attribute.attalign into a uint8, which stores the number of bytes to align the column's value by in the tuple. This allows tuple deformation and tuple size calculations to move away from using the inefficient att_align_nominal() macro, which manually checks each TYPALIGN_* char to translate that into the alignment bytes for the given type. Effectively, this commit changes those to TYPEALIGN calls, which are branchless and only perform some simple arithmetic with some bit-twiddling. The removed branches were often mispredicted by CPUs, especially so in real-world tables which often contain a mishmash of different types with different alignment requirements. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
* Introduce CompactAttribute array in TupleDesc, take 2David Rowley2024-12-20
| | | | | | | | | | | | | | | | | | | | | | | | The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Reviewed-by: Andres Freund, Victor Yegorov Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
* Get rid of old version of BuildTupleHashTable().Tom Lane2024-12-19
| | | | | | | | | | | | | | It was reasonable to preserve the old API of BuildTupleHashTable() in the back branches, but in HEAD we should actively discourage use of that version. There are no remaining callers in core, so just get rid of it. Then rename BuildTupleHashTableExt() back to BuildTupleHashTable(). While at it, fix up the miserably-poorly-maintained header comment for BuildTupleHashTable[Ext]. It looks like more than one patch in this area has had the opinion that updating comments is beneath them. Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us
* Use ExecGetCommonSlotOps infrastructure in more places.Tom Lane2024-12-19
| | | | | | | | | | | | | | Append, MergeAppend, and RecursiveUnion can all use the support functions added in commit 276279295. The first two can report a fixed result slot type if all their children return the same fixed slot type. That does nothing for the append step itself, but might allow optimizations in the parent plan node. RecursiveUnion can optimize tuple hash table operations in the same way as SetOp now does. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
* Convert SetOp to read its inputs as outerPlan and innerPlan.Tom Lane2024-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The original design for set operations involved appending the two input relations into one and adding a flag column that allows distinguishing which side each row came from. Then the SetOp node pries them apart again based on the flag. This is bizarre. The only apparent reason to do it is that when sorting, we'd only need one Sort node not two. But since sorting is at least O(N log N), sorting all the data is actually worse than sorting each side separately --- plus, we have no chance of taking advantage of presorted input. On top of that, adding the flag column frequently requires an additional projection step that adds cycles, and then the Append node isn't free either. Let's get rid of all of that and make the SetOp node have two separate children, using the existing outerPlan/innerPlan infrastructure. This initial patch re-implements nodeSetop.c and does a bare minimum of work on the planner side to generate correctly-shaped plans. In particular, I've tried not to change the cost estimates here, so that the visible changes in the regression test results will only involve removal of useless projection steps and not any changes in whether to use sorted vs hashed mode. For SORTED mode, we combine successive identical tuples from each input into groups, and then merge-join the groups. The tuple comparisons now use SortSupport instead of simple equality, but the group-formation part should involve roughly the same number of tuple comparisons as before. The cross-comparisons between left and right groups probably add to that, but I'm not sure to quantify how many more comparisons we might need. For HASHED mode, nodeSetop's logic is almost the same as before, just refactored into two separate loops instead of one loop that has an assumption that it will see all the left-hand inputs first. In both modes, I added early-exit logic to not bother reading the right-hand relation if the left-hand input is empty, since neither INTERSECT nor EXCEPT modes can produce any output if the left input is empty. This could have been done before in the hashed mode, but not in sorted mode. Sorted mode can also stop as soon as it exhausts the left input; any remaining right-hand tuples cannot have matches. Also, this patch adds some infrastructure for detecting whether child plan nodes all output the same type of tuple table slot. If they do, the hash table logic can use slightly more efficient code based on assuming that that's the input slot type it will see. We'll make use of that infrastructure in other plan node types later. Patch by me; thanks to Richard Guo and David Rowley for review. Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
* Remove extra prefetch iterator setup for Bitmap Table ScanMelanie Plageman2024-12-19
| | | | | | | 1a0da347a7ac98db replaced Bitmap Table Scan's separate private and shared bitmap iterators with a unified iterator. It accidentally set up the prefetch iterator twice for non-parallel bitmap table scans. Remove the extra set up call to tbm_begin_iterate().
* Fix bitmap table scan crash on iterator releaseMelanie Plageman2024-12-19
| | | | | | | | | | | 1a0da347a7ac98db replaced Bitmap Table Scan's individual private and shared iterators with a unified iterator. It neglected, however, to check if the iterator had already been cleaned up before doing so on rescan. Add this check both on rescan and end scan to be safe. Reported-by: Richard Guo Author: Richard Guo Discussion: https://postgr.es/m/CAMbWs48nrhcLY1kcd-u9oD%2B6yiS631F_8Fx8ZGsO-BYDwH%2Bbyw%40mail.gmail.com
* Optimize grouping equality checks with virtual slotsDavid Rowley2024-12-19
| | | | | | | | | | | | | | | | | | 8f4ee9626 fixed an old Assert failure that could happen when the slot type used to look up the hash table for BuildTupleHashTableExt() users wasn't a TTSOpsMinimalTuple slot. The fix for that in the back branches had to be to pass the TupleTableSlotOps as NULL, however in master, since we have the inputOps parameter as was added by d96d1d515, we can pass that down instead. At least one caller uses a fixed slot that's always TTSOpsVirtual, so passing down inputOps for these cases allows ExecBuildGroupingEqual() to skip adding the EEOP_INNER_FETCHSOME ExprEvalStep. This should increase the performance of hashed subplans very slightly. Author: Tom Lane, David Rowley Discussion: https://postgr.es/m/2543667.1734483723@sss.pgh.pa.us
* Fix Assert failure in WITH RECURSIVE UNION queriesDavid Rowley2024-12-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the non-recursive part of a recursive CTE ended up using TTSOpsBufferHeapTuple as the table slot type, then a duplicate value could cause an Assert failure in CheckOpSlotCompatibility() when checking the hash table for the duplicate value. The expected slot type for the deform step was TTSOpsMinimalTuple so the Assert failed when the TTSOpsBufferHeapTuple slot was used. This is a long-standing bug which we likely didn't notice because it seems much more likely that the non-recursive term would have required projection and used a TTSOpsVirtual slot, which CheckOpSlotCompatibility is ok with. There doesn't seem to be any harm done here other than the Assert failure. Both TTSOpsMinimalTuple and TTSOpsBufferHeapTuple slot types require tuple deformation, so the EEOP_*_FETCHSOME ExprState step would have properly existed in the ExprState. The solution is to pass NULL for the ExecBuildGroupingEqual's 'lops' parameter. This means the ExprState's EEOP_*_FETCHSOME step won't expect a fixed slot type. This makes CheckOpSlotCompatibility() happy as no checking is performed when the ExprEvalStep is not expecting a fixed slot type. Reported-by: Richard Guo Reviewed-by: Tom Lane Discussion: https://postgr.es/m/CAMbWs4-8U9q2LAtf8+ghV11zeUReA3AmrYkxzBEv0vKnDxwkKA@mail.gmail.com Backpatch-through: 13, all supported versions
* Bitmap Table Scans use unified TBMIteratorMelanie Plageman2024-12-18
| | | | | | | | | | | | | | With the repurposing of TBMIterator as an interface for both parallel and serial iteration through TIDBitmaps in commit 7f9d4187e7bab10329cc, bitmap table scans may now use it. Modify bitmap table scan code to use the TBMIterator. This requires moving around a bit of code, so a few variables are initialized elsewhere. Author: Melanie Plageman Reviewed-by: Tomas Vondra Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
* Add common interface for TBMIteratorsMelanie Plageman2024-12-18
| | | | | | | | | | | | | Add and use TBMPrivateIterator, which replaces the current TBMIterator for serial use cases, and repurpose TBMIterator to be a unified interface for both the serial ("private") and parallel ("shared") TID Bitmap iterator interfaces. This encapsulation simplifies call sites for callers supporting both parallel and serial TID Bitmap access. TBMIterator is not yet used in this commit. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Heikki Linnakangas Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
* Fix incorrect slot type in BuildTupleHashTableExtDavid Rowley2024-12-18
| | | | | | | | | | | | | | | | | | | | | | | | | | 0f5738202 adjusted the execGrouping.c code so it made use of ExprStates to generate hash values. That commit made a wrong assumption that the slot type to pass to ExecBuildHash32FromAttrs() is always &TTSOpsMinimalTuple. That's not the case as the slot type depends on the slot type passed to LookupTupleHashEntry(), which for nodeRecursiveunion.c, could be any of the current slot types. Here we fix this by adding a new parameter to BuildTupleHashTableExt() to allow the slot type to be passed in. In the case of nodeSubplan.c and nodeAgg.c the slot type is always &TTSOpsVirtual, so for both of those cases, it's beneficial to pass the known slot type as that allows ExecBuildHash32FromAttrs() to skip adding the tuple deform step to the resulting ExprState. Another possible fix would have been to have ExecBuildHash32FromAttrs() set "fetch.kind" to NULL so that ExecComputeSlotInfo() always determines the EEOP_INNER_FETCHSOME is required, however, that option isn't favorable as slows down aggregation and hashed subplan evaluation due to the extra (needless) deform step. Thanks to Nathan Bossart for bisecting to find the offending commit based on Paul's report. Reported-by: Paul Ramsey <pramsey@cleverelephant.ca> Discussion: https://postgr.es/m/99F064C1-B3EB-4BE7-97D2-D2A0AA487A71@cleverelephant.ca
* Use ExprStates for hashing in GROUP BY and SubPlansDavid Rowley2024-12-11
| | | | | | | | | | | | | | | | | | | | | | This speeds up obtaining hash values for GROUP BY and hashed SubPlans by using the ExprState support for hashing, thus allowing JIT compilation for obtaining hash values for these operations. This, even without JIT compilation, has been shown to improve Hash Aggregate performance in some cases by around 15% and hashed NOT IN queries in one case by over 30%, however, real-world cases are likely to see smaller gains as the test cases used were purposefully designed to have high hashing overheads by keeping the hash table small to prevent additional memory overheads that would be a factor when working with large hash tables. In passing, fix a hypothetical bug in ExecBuildHash32Expr() so that the initial value is stored directly in the ExprState's result field if there are no expressions to hash. None of the current users of this function use an initial value, so the bug is only hypothetical. Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CAApHDvpYSO3kc9UryMevWqthTBrxgfd9djiAjKHMPUSQeX9vdQ@mail.gmail.com
* Speedup Hash Joins with dedicated functions for ExprState hashingDavid Rowley2024-12-11
| | | | | | | | | | | | | | | | | | | Hashing of a single Var is a very common operation for ExprState to perform. Here we add dedicated ExecJust* functions which helps speed up Hash Joins by removing the interpretation overhead in ExecInterpExpr(). This change currently only affects Hash Joins on a single column. Hash Joins with multiple join keys or an expression still run through ExecInterpExpr(). Some testing has shown up to 10% query performance increases on recent AMD hardware and nearly 7% increase on an Apple M2 for a query performing a hash join with a large number of lookups on a small hash table. This change was extracted from a larger patch which adjusts GROUP BY / hashed subplans / hashed set operations to use ExprState hashing. Discussion: https://postgr.es/m/CAApHDvr8Zc0ZgzVoCZLdHGOFNhiJeQ6vrUcS9V7N23zMWQb-eA@mail.gmail.com
* Doc: add some commentary about ExecutorRun's NoMovement special case.Tom Lane2024-12-10
| | | | | | | | | | | Robert Haas expressed concern about whether commit 3eea7a0c9 exposed the parallel-execution machinery to a case it isn't tested for, namely a second non-parallel execution of a plan after a parallel execution. Investigation shows that that can't happen because of pquery.c's manipulation of the scan direction, but it sure wasn't obvious to start with. Add some commentary about that. Discussion: https://postgr.es/m/CA+TgmoagyKQy=HFw+wLo0AKTYWHui+iKswZ8Jnqqd-cFby-WVg@mail.gmail.com
* Support for GiST in get_equal_strategy_number()Peter Eisentraut2024-12-10
| | | | | | | | | | | | | A WITHOUT OVERLAPS primary key or unique constraint is accepted as a REPLICA IDENTITY, since it guarantees uniqueness. But subscribers applying logical decoding messages would fail because there was not support for looking up the equals operator for a gist index. This fixes that: For GiST indexes we can use the stratnum GiST support function. Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Replace get_equal_strategy_number_for_am() by get_equal_strategy_number()Peter Eisentraut2024-12-10
| | | | | | | | | | | | | | | | | get_equal_strategy_number_for_am() gets the equal strategy number for an AM. This currently only supports btree and hash. In the more general case, this also depends on the operator class (see for example GistTranslateStratnum()). To support that, replace this function with get_equal_strategy_number() that takes an opclass and derives it from there. (This function already existed before as a static function, so the signature is kept for simplicity.) This patch is only a refactoring, it doesn't add support for other index AMs such as gist. This will be done separately. Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: vignesh C <vignesh21@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Improve internal logical replication error for missing equality strategyPeter Eisentraut2024-12-10
| | | | | | | | | | | This "shouldn't happen", except right now it can with a temporal gist index (to be fixed soon), because of missing gist support in get_equal_strategy_number(). But right now, the error is not caught right away, but instead you get the subsequent error about a "missing operator 0". This makes the error more accurate. Author: Paul Jungwirth <pj@illuminatedcomputing.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Simplify executor's determination of whether to use parallelism.Tom Lane2024-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our parallel-mode code only works when we are executing a query in full, so ExecutePlan must disable parallel mode when it is asked to do partial execution. The previous logic for this involved passing down a flag (variously named execute_once or run_once) from callers of ExecutorRun or PortalRun. This is overcomplicated, and unsurprisingly some of the callers didn't get it right, since it requires keeping state that not all of them have handy; not to mention that the requirements for it were undocumented. That led to assertion failures in some corner cases. The only state we really need for this is the existing QueryDesc.already_executed flag, so let's just put all the responsibility in ExecutePlan. (It could have been done in ExecutorRun too, leading to a slightly shorter patch -- but if there's ever more than one caller of ExecutePlan, it seems better to have this logic in the subroutine than the callers.) This makes those ExecutorRun/PortalRun parameters unnecessary. In master it seems okay to just remove them, returning the API for those functions to what it was before parallelism. Such an API break is clearly not okay in stable branches, but for them we can just leave the parameters in place after documenting that they do nothing. Per report from Yugo Nagata, who also reviewed and tested this patch. Back-patch to all supported branches. Discussion: https://postgr.es/m/20241206062549.710dc01cf91224809dd6c0e1@sraoss.co.jp
* Fix right-semi-joins in HashJoin rescansRichard Guo2024-12-09
| | | | | | | | | | | | | | | | | When resetting a HashJoin node for rescans, if it is a single-batch join and there are no parameter changes for the inner subnode, we can just reuse the existing hash table without rebuilding it. However, for join types that depend on the inner-tuple match flags in the hash table, we need to reset these match flags to avoid incorrect results. This applies to right, right-anti, right-semi, and full joins. When I introduced "Right Semi Join" plan shapes in aa86129e1, I failed to reset the match flags in the hash table for right-semi joins in rescans. This oversight has been shown to produce incorrect results. This patch fixes it. Author: Richard Guo Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com
* Improve the error message introduced in commit 87ce27de696.Amit Kapila2024-12-09
| | | | | | | | | | | The error detail message "Replica identity consists of an unpublished generated column." implies that the entire replica identity is made up of an unpublished generated column which may not be the case. Reported-by: Peter Smith Author: Shlok Kyal Reviewed-by: Peter Smith, Amit Kapila Discussion: https://postgr.es/m/CAHut+PuwMhKx0PhOA4APhJTLoBGNykbeCQpr_CuwGT-SkswG5w@mail.gmail.com
* Fix possible crash during WindowAgg evaluationDavid Rowley2024-12-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When short-circuiting WindowAgg node evaluation on the top-level WindowAgg node using quals on monotonic window functions, because the WindowAgg run condition can mean there's no need to evaluate subsequent window function results in the same partition once the run condition becomes false, it was possible that the executor would use stale results from the previous invocation of the window function in some cases. A fix for this was partially done by a5832722, but that commit only fixed the issue for non-top-level WindowAgg nodes. I mistakenly thought that the top-level WindowAgg didn't have this issue, but Jayesh's example case clearly shows that's incorrect. At the time, I also thought that this only affected 32-bit systems as all window functions which then supported run conditions returned BIGINT, however, that's wrong as ExecProject is still called and that could cause evaluation of any other window function belonging to the same WindowAgg node, one of which may return a byref type. The only queries affected by this are WindowAggs with a "Run Condition" which contains at least one window function with a byref result type, such as lead() or lag() on a byref column. The window clause must also contain a PARTITION BY clause (without a PARTITION BY, execution of the WindowAgg stops immediately when the run condition becomes false and there's no risk of using the stale results). Reported-by: Jayesh Dehankar Discussion: https://postgr.es/m/193261e2c4d.3dd3cd7c1842.871636075166132237@zohocorp.com Backpatch-through: 15, where WindowAgg run conditions were added
* Ensure stored generated columns must be published when required.Amit Kapila2024-12-04
| | | | | | | | | | | | | | | | | | | | | | | Ensure stored generated columns that are part of REPLICA IDENTITY must be published explicitly for UPDATE and DELETE operations to be published. We can publish generated columns by listing them in the column list or by enabling the publish_generated_columns option. This commit changes the behavior of the test added in commit adedf54e65 by giving an ERROR for the UPDATE operation in such cases. There is no way to trigger the bug reported in commit adedf54e65 but we didn't remove the corresponding code change because it is still relevant when replicating changes from a publisher with version less than 18. We decided not to backpatch this behavior change to avoid the risk of breaking existing output plugins that may be sending generated columns by default although we are not aware of any such plugin. Also, we didn't see any reports related to this on STABLE branches which is another reason not to backpatch this change. Author: Shlok Kyal, Hou Zhijie Reviewed-by: Vignesh C, Amit Kapila Discussion: https://postgr.es/m/CANhcyEVw4V2Awe2AB6i0E5AJLNdASShGfdBLbUd1XtWDboymCA@mail.gmail.com
* Revert "Introduce CompactAttribute array in TupleDesc"David Rowley2024-12-03
| | | | | | | | | | This reverts commit d28dff3f6cd6a7562fb2c211ac0fb74a33ffd032. Quite a large number of buildfarm members didn't like this commit and it's not yet clear why. Reverting this before too many animals turn red. Discussion: https://postgr.es/m/CAApHDvr9i6T5=iAwQCxFDgMsthr_obVxgwBaEJkC8KUH6yM3Hw@mail.gmail.com
* Introduce CompactAttribute array in TupleDescDavid Rowley2024-12-03
| | | | | | | | | | | | | | | | | | | | | | | | | | The new compact_attrs array stores a few select fields from FormData_pg_attribute in a more compact way, using only 16 bytes per column instead of the 104 bytes that FormData_pg_attribute uses. Using CompactAttribute allows performance-critical operations such as tuple deformation to be performed without looking at the FormData_pg_attribute element in TupleDesc which means fewer cacheline accesses. With this change, NAMEDATALEN could be increased with a much smaller negative impact on performance. For some workloads, tuple deformation can be the most CPU intensive part of processing the query. Some testing with 16 columns on a table where the first column is variable length showed around a 10% increase in transactions per second for an OLAP type query performing aggregation on the 16th column. However, in certain cases, the increases were much higher, up to ~25% on one AMD Zen4 machine. This also makes pg_attribute.attcacheoff redundant. A follow-on commit will remove it, thus shrinking the FormData_pg_attribute struct by 4 bytes. Author: David Rowley Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com Reviewed-by: Andres Freund, Victor Yegorov
* Remove useless casts to (void *)Peter Eisentraut2024-11-28
| | | | | | | | Many of them just seem to have been copied around for no real reason. Their presence causes (small) risks of hiding actual type mismatches or silently discarding qualifiers Discussion: https://www.postgresql.org/message-id/flat/461ea37c-8b58-43b4-9736-52884e862820@eisentraut.org
* Fix NULLIF()'s handling of read-write expanded objects.Tom Lane2024-11-25
| | | | | | | | | | | | | | | | | | | | | | | If passed a read-write expanded object pointer, the EEOP_NULLIF code would hand that same pointer to the equality function and then (unless equality was reported) also return the same pointer as its value. This is no good, because a function that receives a read-write expanded object pointer is fully entitled to scribble on or even delete the object, thus corrupting the NULLIF output. (This problem is likely unobservable with the equality functions provided in core Postgres, but it's easy to demonstrate with one coded in plpgsql.) To fix, make sure the pointer passed to the equality function is read-only. We can still return the original read-write pointer as the NULLIF result, allowing optimization of later operations. Per bug #18722 from Alexander Lakhin. This has been wrong since we invented expanded objects, so back-patch to all supported branches. Discussion: https://postgr.es/m/18722-fd9e645448cc78b4@postgresql.org
* Ensure cached plans are correctly marked as dependent on role.Nathan Bossart2024-11-11
| | | | | | | | | | | | | | If a CTE, subquery, sublink, security invoker view, or coercion projection references a table with row-level security policies, we neglected to mark the plan as potentially dependent on which role is executing it. This could lead to later executions in the same session returning or hiding rows that should have been hidden or returned instead. Reported-by: Wolfgang Walther Reviewed-by: Noah Misch Security: CVE-2024-10976 Backpatch-through: 12
* Add two attributes to pg_stat_database for parallel workers activityMichael Paquier2024-11-11
| | | | | | | | | | | | | | | | | | | | | | Two attributes are added to pg_stat_database: * parallel_workers_to_launch, counting the total number of parallel workers that were planned to be launched. * parallel_workers_launched, counting the total number of parallel workers actually launched. The ratio of both fields can provide hints that there are not enough slots available when launching parallel workers, also useful when pg_stat_statements is not deployed on an instance (i.e. cf54a2c00254). This commit relies on de3a2ea3b264, that has added two fields to EState, that get incremented when executing Gather or GatherMerge nodes. A test is added in select_parallel, where parallel workers are spawned. Bump catalog version. Author: Benoit Lobréau Discussion: https://postgr.es/m/783bc7f7-659a-42fa-99dd-ee0565644e25@dalibo.com
* Improve fix for not entering parallel mode when holding interrupts.Tom Lane2024-11-08
| | | | | | | | | | | | | | | | | | | | | | | Commit ac04aa84a put the shutoff for this into the planner, which is not ideal because it doesn't prevent us from re-using a previously made parallel plan. Revert the planner change and instead put the shutoff into InitializeParallelDSM, modeling it on the existing code there for recovering from failure to allocate a DSM segment. However, that code path is mostly untested, and testing a bit harder showed there's at least one bug: ExecHashJoinReInitializeDSM is not prepared for us to have skipped doing parallel DSM setup. I also thought the Assert in ReinitializeParallelWorkers is pretty ill-advised, and replaced it with a silent Min() operation. The existing test case added by ac04aa84a serves fine to test this version of the fix, so no change needed there. Patch by me, but thanks to Noah Misch for the core idea that we could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED. Back-patch to v12, as ac04aa84a was. Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
* Fix hypothetical bug in ExprState building for hashingDavid Rowley2024-11-06
| | | | | | | | | | | | | | | | | | | adf97c156 gave ExprStates the ability to hash expressions and return a single hash value. That commit supports seeding the hash value with an initial value to have that blended into the final hash value. Here we fix a hypothetical bug where if there are zero expressions to hash, the initial value is stored in the wrong location. The existing code stored the initial value in an intermediate location expecting that when the expressions were hashed that those steps would store the final hash value in the ExprState.resvalue field. However, that wouldn't happen when there are zero expressions to hash. The correct thing to do instead is to have a special case for zero expressions and when we hit that case, store the initial value directly in the ExprState.resvalue. The reason that this is a hypothetical bug is that no code currently calls ExecBuildHash32Expr passing a non-zero initial value. Discussion: https://postgr.es/m/CAApHDvpMAL_zxbMRr1LOex3O7Y7R7ZN2i8iUFLQhqQiJMAg3qw@mail.gmail.com
* Remove duplicate words in commentsDaniel Gustafsson2024-10-31
| | | | | | | | A few comments contained duplicate "the" in sentences, fix by removing one occurrence. Author: Vignesh C <vignesh21@gmail.com> Discussion: https://postgr.es/m/CALDaNm2aEEiPwGJmPdzBxROVvs8n75yCjKz4K1f1B2TdWpzxTA@mail.gmail.com
* Remove unused field from SubPlanState structDavid Rowley2024-10-31
| | | | | | | | | | | | | | | | bf6c614a2 did some conversion work to use ExprState instead of manually calling equality functions to check if one set of values is not distinct from another set. That patch removed many of the fields that became redundant as a result of that change, but it forgot to remove SubPlanState.tab_eq_funcs. Fix that. In passing, fix the header comment for TupleHashEntryData to correctly spell the field name it's talking about. Author: Rafia Sabih <rafia.pghackers@gmail.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CA+FpmFeycdombFzrjZw7Rmc29CVm4OOzCWwu=dVBQ6q=PX8SvQ@mail.gmail.com Discussion: https://postgr.es/m/CAApHDvrWR2jYVhec=COyF2g2BE_ns91NDsCHAMFiXbyhEujKdQ@mail.gmail.com
* Remove unused #include's from backend .c filesPeter Eisentraut2024-10-27
| | | | | | | | as determined by IWYU These are mostly issues that are new since commit dbbca2cf299. Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
* Make table_scan_bitmap_next_block() async-friendlyMelanie Plageman2024-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move all responsibility for indicating a block is exhuasted into table_scan_bitmap_next_tuple() and advance the main iterator in heap-specific code. This flow control makes more sense and is a step toward using the read stream API for bitmap heap scans. Previously, table_scan_bitmap_next_block() returned false to indicate table_scan_bitmap_next_tuple() should not be called for the tuples on the page. This happened both when 1) there were no visible tuples on the page and 2) when the block returned by the iterator was past the end of the table. BitmapHeapNext() (generic bitmap table scan code) handled the case when the bitmap was exhausted. It makes more sense for table_scan_bitmap_next_tuple() to return false when there are no visible tuples on the page and table_scan_bitmap_next_block() to return false when the bitmap is exhausted or there are no more blocks in the table. As part of this new design, TBMIterateResults are no longer used as a flow control mechanism in BitmapHeapNext(), so we removed table_scan_bitmap_next_tuple's TBMIterateResult parameter. Note that the prefetch iterator is still saved in the BitmapHeapScanState node and advanced in generic bitmap table scan code. This is because 1) it was not necessary to change the prefetch iterator location to change the flow control in BitmapHeapNext() 2) modifying prefetch iterator management requires several more steps better split over multiple commits and 3) the prefetch iterator will be removed once the read stream API is used. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas, Mark Dilger Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi