| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit refactors ExecScan() by moving its tuple-fetching,
filtering, and projection logic into an inline-able function,
ExecScanExtended(), defined in src/include/executor/execScan.h.
ExecScanExtended() accepts parameters for EvalPlanQual state,
qualifiers (ExprState), and projection (ProjectionInfo).
Specialized variants of the execution function of a given Scan node
(for example, ExecSeqScan() for SeqScan) can then pass const-NULL for
unused parameters. This allows the compiler to inline the logic and
eliminate unnecessary branches or checks. Each variant function thus
contains only the necessary code, optimizing execution for scans
where these features are not needed.
The variant function to be used is determined in the ExecInit*()
function of the node and assigned to the ExecProcNode function pointer
in the node's PlanState, effectively turning runtime checks and
conditional branches on the NULLness of epqstate, qual, and projInfo
into static ones, provided the compiler successfully eliminates
unnecessary checks from the inlined code of ExecScanExtended().
Currently, only ExecSeqScan() is modified to take advantage of this
inline-ability. Other Scan nodes might benefit from such specialized
variant functions but that is left as future work.
Benchmarks performed by Junwang Zhao, David Rowley and myself show up
to a 5% reduction in execution time for queries that rely heavily on
Seq Scans. The most significant improvements were observed in
scenarios where EvalPlanQual, qualifiers, and projection were not
required, but other cases also benefit from reduced runtime overhead
due to the inlining and removal of unnecessary code paths.
The idea for this patch first came from Andres Freund in an off-list
discussion. The refactoring approach implemented here is based on a
proposal by David Rowley, significantly improving upon the patch I
(amitlan) initially proposed.
Suggested-by: Andres Freund <andres@anarazel.de>
Co-authored-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Tested-by: Junwang Zhao <zhjwpku@gmail.com>
Tested-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CA+HiwqGaH-otvqW_ce-paL=96JvU4j+Xbuk+14esJNDwefdkOg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows the RETURNING list of INSERT/UPDATE/DELETE/MERGE queries
to explicitly return old and new values by using the special aliases
"old" and "new", which are automatically added to the query (if not
already defined) while parsing its RETURNING list, allowing things
like:
RETURNING old.colname, new.colname, ...
RETURNING old.*, new.*
Additionally, a new syntax is supported, allowing the names "old" and
"new" to be changed to user-supplied alias names, e.g.:
RETURNING WITH (OLD AS o, NEW AS n) o.colname, n.colname, ...
This is useful when the names "old" and "new" are already defined,
such as inside trigger functions, allowing backwards compatibility to
be maintained -- the interpretation of any existing queries that
happen to already refer to relations called "old" or "new", or use
those as aliases for other relations, is not changed.
For an INSERT, old values will generally be NULL, and for a DELETE,
new values will generally be NULL, but that may change for an INSERT
with an ON CONFLICT ... DO UPDATE clause, or if a query rewrite rule
changes the command type. Therefore, we put no restrictions on the use
of old and new in any DML queries.
Dean Rasheed, reviewed by Jian He and Jeff Davis.
Discussion: https://postgr.es/m/CAEZATCWx0J0-v=Qjc6gXzR=KtsdvAE7Ow=D=mu50AgOe+pvisQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes commit 7406ab623fe in that the gist strategy number
mapping support function is changed to use the CompareType enum as
input, instead of the "well-known" RT*StrategyNumber strategy numbers.
This is a bit cleaner, since you are not dealing with two sets of
strategy numbers. Also, this will enable us to subsume this system
into a more general system of using CompareType to define operator
semantics across index methods.
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RowCompareType served as a way to describe the fundamental meaning of
an operator, notionally independent of an operator class (although so
far this was only really supported for btrees). Its original purpose
was for use inside RowCompareExpr, and it has also found some small
use outside, such as for get_op_btree_interpretation().
We want to expand this now, as a more general way to describe operator
semantics for other index access methods, including gist (to improve
GistTranslateStratnum()) and others not written yet. To avoid future
confusion, we rename the type to CompareType and the symbols from
ROWCOMPARE_XXX to COMPARE_XXX to reflect their more general purpose.
Reviewed-by: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
|
|
|
|
|
| |
This reverts commit e0ece2a981ee9068f50c4423e303836c2585eb02 due to
performance regressions.
Reported-by: David Rowley
|
|
|
|
| |
Oversight in commit e0ece2a98.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for the NOT ENFORCED/ENFORCED flag for constraints,
with support for check constraints.
The plan is to eventually support this for foreign key constraints,
where it is typically more useful.
Note that CHECK constraints do not currently support ALTER operations,
so changing the enforceability of an existing constraint isn't
possible without dropping and recreating it. This could be added
later.
Author: Amul Sul <amul.sul@enterprisedb.com>
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: jian he <jian.universality@gmail.com>
Tested-by: Triveni N <triveni.n@enterprisedb.com>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA@mail.gmail.com
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, the caller needed to allocate the memory and the
TupleHashTable would store a pointer to it. That wastes space for the
palloc overhead as well as the size of the pointer itself.
Now, the TupleHashTable relies on the caller to correctly specify the
additionalsize, and allocates that amount of space. The caller can
then request a pointer into that space.
Discussion: https://postgr.es/m/b9cbf0219a9859dc8d240311643ff4362fd9602c.camel@j-davis.com
Reviewed-by: Heikki Linnakangas
|
|
|
|
|
|
|
|
|
| |
Functions hash_agg_entry_size() and build_hash_tables() make use of
those values for memory size estimates.
Because this change only affects memory estimates, don't backpatch.
Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
|
|
|
|
|
|
|
|
| |
Provide additionalsize argument, which can affect the calculations for
'nbuckets'. Also, future work for Hash Aggregation will rely on the
correct additionalsize.
Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
|
|
|
|
| |
Discussion: https://postgr.es/m/7530bd8783b1a78d53a3c70383e38d8da0a5ffe5.camel%40j-davis.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CHUNKHDRSZ was defined as 16 bytes, which was true when that code went in,
but since c6e0fe1f2, 8 is a more accurate value. Here we adjust it to use
sizeof(MemoryChunk), which is normally 8, or 16 for cassert builds.
c6e0fe1f2 first appeared in v16, so this is technically wrong in v16 up
to master, but let's apply this only to master as adjusting this does
influence the estimated number of batches in the aggregate costing code
and we don't want to cause plan instability in released versions.
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAApHDvpMpRQvsTqZo3FinXkgytwxwF8sCyZm83xDj-1s_hLe+w@mail.gmail.com
|
|
|
|
| |
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adjusts slot_deform_heap_tuple() to add special-case loops to
eliminate much of the branching that was done within the body of the
main deform loop.
Previously, while looping over each attribute to deform,
slot_deform_heap_tuple() would always recheck if the given attribute was
NULL by looking at HeapTupleHasNulls() and if so, went on to check the
tuple's NULL bitmap. Since many tuples won't contain any NULLs, we can
just check HeapTupleHasNulls() once and when there are no NULLs, use a
more compact version of the deforming loop which contains no NULL checking
code at all.
The same is possible for the "slow" mode checking part of the loop. That
variable was checked several times for each attribute, once to determine
if the offset to the attribute value could be taken from the attcacheoff,
and again to check if the offset could be cached for next time.
These "slow" checks can mostly be eliminated by instead having multiple
loops. Initially, we can start in the non-slow loop and break out of
that loop if and only if we must stop caching the offset. This
eliminates branching for both slow and non-slow deforming methods. The
amount of code required for the no nulls / non-slow version is very
small. It's possible to have separate loops like this due to the fact
that once we move into slow mode, we never need to switch back into
non-slow mode for a given tuple.
We have the compiler take care of writing out the multiple required
loops by having a pg_attribute_always_inline function which gets called
various times passing in constant values for the "slow" and "hasnulls"
parameters. This allows the compiler to eliminate const-false branches
and remove comparisons for const-true ones.
This commit has shown overall query performance increases of around 5-20%
in deform-heavy OLAP-type workloads.
Author: David Rowley
Reviewed-by: Victor Yegorov
Discussion: https://postgr.es/m/CAGnEbog92Og2CpC2S8=g_HozGsWtt_3kRS1sXjLz0jKSoCNfLw@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvo9e0XG71WrefYaRv5n4xNPLK4k8LjD0mSR3c9KR2vi2Q@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Here we convert CompactAttribute.attalign from a char, which is directly
derived from pg_attribute.attalign into a uint8, which stores the number
of bytes to align the column's value by in the tuple.
This allows tuple deformation and tuple size calculations to move away
from using the inefficient att_align_nominal() macro, which manually
checks each TYPALIGN_* char to translate that into the alignment bytes
for the given type. Effectively, this commit changes those to TYPEALIGN
calls, which are branchless and only perform some simple arithmetic with
some bit-twiddling.
The removed branches were often mispredicted by CPUs, especially so in
real-world tables which often contain a mishmash of different types
with different alignment requirements.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Reviewed-by: Andres Freund, Victor Yegorov
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It was reasonable to preserve the old API of BuildTupleHashTable()
in the back branches, but in HEAD we should actively discourage use
of that version. There are no remaining callers in core, so just
get rid of it. Then rename BuildTupleHashTableExt() back to
BuildTupleHashTable().
While at it, fix up the miserably-poorly-maintained header comment
for BuildTupleHashTable[Ext]. It looks like more than one patch in
this area has had the opinion that updating comments is beneath them.
Discussion: https://postgr.es/m/538343.1734646986@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Append, MergeAppend, and RecursiveUnion can all use the support
functions added in commit 276279295. The first two can report a
fixed result slot type if all their children return the same fixed
slot type. That does nothing for the append step itself, but might
allow optimizations in the parent plan node. RecursiveUnion can
optimize tuple hash table operations in the same way as SetOp now
does.
Patch by me; thanks to Richard Guo and David Rowley for review.
Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The original design for set operations involved appending the two
input relations into one and adding a flag column that allows
distinguishing which side each row came from. Then the SetOp node
pries them apart again based on the flag. This is bizarre. The
only apparent reason to do it is that when sorting, we'd only need
one Sort node not two. But since sorting is at least O(N log N),
sorting all the data is actually worse than sorting each side
separately --- plus, we have no chance of taking advantage of
presorted input. On top of that, adding the flag column frequently
requires an additional projection step that adds cycles, and then
the Append node isn't free either. Let's get rid of all of that
and make the SetOp node have two separate children, using the
existing outerPlan/innerPlan infrastructure.
This initial patch re-implements nodeSetop.c and does a bare minimum
of work on the planner side to generate correctly-shaped plans.
In particular, I've tried not to change the cost estimates here,
so that the visible changes in the regression test results will only
involve removal of useless projection steps and not any changes in
whether to use sorted vs hashed mode.
For SORTED mode, we combine successive identical tuples from each
input into groups, and then merge-join the groups. The tuple
comparisons now use SortSupport instead of simple equality, but
the group-formation part should involve roughly the same number of
tuple comparisons as before. The cross-comparisons between left and
right groups probably add to that, but I'm not sure to quantify how
many more comparisons we might need.
For HASHED mode, nodeSetop's logic is almost the same as before,
just refactored into two separate loops instead of one loop that
has an assumption that it will see all the left-hand inputs first.
In both modes, I added early-exit logic to not bother reading the
right-hand relation if the left-hand input is empty, since neither
INTERSECT nor EXCEPT modes can produce any output if the left input
is empty. This could have been done before in the hashed mode, but
not in sorted mode. Sorted mode can also stop as soon as it exhausts
the left input; any remaining right-hand tuples cannot have matches.
Also, this patch adds some infrastructure for detecting whether
child plan nodes all output the same type of tuple table slot.
If they do, the hash table logic can use slightly more efficient
code based on assuming that that's the input slot type it will see.
We'll make use of that infrastructure in other plan node types later.
Patch by me; thanks to Richard Guo and David Rowley for review.
Discussion: https://postgr.es/m/1850138.1731549611@sss.pgh.pa.us
|
|
|
|
|
|
|
| |
1a0da347a7ac98db replaced Bitmap Table Scan's separate private and
shared bitmap iterators with a unified iterator. It accidentally set up
the prefetch iterator twice for non-parallel bitmap table scans. Remove
the extra set up call to tbm_begin_iterate().
|
|
|
|
|
|
|
|
|
|
|
| |
1a0da347a7ac98db replaced Bitmap Table Scan's individual private and
shared iterators with a unified iterator. It neglected, however, to
check if the iterator had already been cleaned up before doing so on
rescan. Add this check both on rescan and end scan to be safe.
Reported-by: Richard Guo
Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs48nrhcLY1kcd-u9oD%2B6yiS631F_8Fx8ZGsO-BYDwH%2Bbyw%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
8f4ee9626 fixed an old Assert failure that could happen when the slot
type used to look up the hash table for BuildTupleHashTableExt() users
wasn't a TTSOpsMinimalTuple slot. The fix for that in the back branches
had to be to pass the TupleTableSlotOps as NULL, however in master,
since we have the inputOps parameter as was added by d96d1d515, we can
pass that down instead.
At least one caller uses a fixed slot that's always TTSOpsVirtual, so
passing down inputOps for these cases allows ExecBuildGroupingEqual() to
skip adding the EEOP_INNER_FETCHSOME ExprEvalStep.
This should increase the performance of hashed subplans very slightly.
Author: Tom Lane, David Rowley
Discussion: https://postgr.es/m/2543667.1734483723@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the non-recursive part of a recursive CTE ended up using
TTSOpsBufferHeapTuple as the table slot type, then a duplicate value
could cause an Assert failure in CheckOpSlotCompatibility() when
checking the hash table for the duplicate value. The expected slot type
for the deform step was TTSOpsMinimalTuple so the Assert failed when the
TTSOpsBufferHeapTuple slot was used.
This is a long-standing bug which we likely didn't notice because it
seems much more likely that the non-recursive term would have required
projection and used a TTSOpsVirtual slot, which CheckOpSlotCompatibility
is ok with.
There doesn't seem to be any harm done here other than the Assert
failure. Both TTSOpsMinimalTuple and TTSOpsBufferHeapTuple slot types
require tuple deformation, so the EEOP_*_FETCHSOME ExprState step would
have properly existed in the ExprState.
The solution is to pass NULL for the ExecBuildGroupingEqual's 'lops'
parameter. This means the ExprState's EEOP_*_FETCHSOME step won't
expect a fixed slot type. This makes CheckOpSlotCompatibility() happy as
no checking is performed when the ExprEvalStep is not expecting a fixed
slot type.
Reported-by: Richard Guo
Reviewed-by: Tom Lane
Discussion: https://postgr.es/m/CAMbWs4-8U9q2LAtf8+ghV11zeUReA3AmrYkxzBEv0vKnDxwkKA@mail.gmail.com
Backpatch-through: 13, all supported versions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the repurposing of TBMIterator as an interface for both parallel
and serial iteration through TIDBitmaps in commit 7f9d4187e7bab10329cc,
bitmap table scans may now use it.
Modify bitmap table scan code to use the TBMIterator. This requires
moving around a bit of code, so a few variables are initialized
elsewhere.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra
Discussion: https://postgr.es/m/c736f6aa-8b35-4e20-9621-62c7c82e2168%40vondra.me
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add and use TBMPrivateIterator, which replaces the current TBMIterator
for serial use cases, and repurpose TBMIterator to be a unified
interface for both the serial ("private") and parallel ("shared") TID
Bitmap iterator interfaces. This encapsulation simplifies call sites for
callers supporting both parallel and serial TID Bitmap access.
TBMIterator is not yet used in this commit.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Heikki Linnakangas
Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
0f5738202 adjusted the execGrouping.c code so it made use of ExprStates to
generate hash values. That commit made a wrong assumption that the slot
type to pass to ExecBuildHash32FromAttrs() is always &TTSOpsMinimalTuple.
That's not the case as the slot type depends on the slot type passed to
LookupTupleHashEntry(), which for nodeRecursiveunion.c, could be any of
the current slot types.
Here we fix this by adding a new parameter to BuildTupleHashTableExt()
to allow the slot type to be passed in. In the case of nodeSubplan.c
and nodeAgg.c the slot type is always &TTSOpsVirtual, so for both of
those cases, it's beneficial to pass the known slot type as that allows
ExecBuildHash32FromAttrs() to skip adding the tuple deform step to the
resulting ExprState. Another possible fix would have been to have
ExecBuildHash32FromAttrs() set "fetch.kind" to NULL so that
ExecComputeSlotInfo() always determines the EEOP_INNER_FETCHSOME is
required, however, that option isn't favorable as slows down aggregation
and hashed subplan evaluation due to the extra (needless) deform step.
Thanks to Nathan Bossart for bisecting to find the offending commit
based on Paul's report.
Reported-by: Paul Ramsey <pramsey@cleverelephant.ca>
Discussion: https://postgr.es/m/99F064C1-B3EB-4BE7-97D2-D2A0AA487A71@cleverelephant.ca
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This speeds up obtaining hash values for GROUP BY and hashed SubPlans by
using the ExprState support for hashing, thus allowing JIT compilation for
obtaining hash values for these operations.
This, even without JIT compilation, has been shown to improve Hash
Aggregate performance in some cases by around 15% and hashed NOT IN
queries in one case by over 30%, however, real-world cases are likely to
see smaller gains as the test cases used were purposefully designed to
have high hashing overheads by keeping the hash table small to prevent
additional memory overheads that would be a factor when working with large
hash tables.
In passing, fix a hypothetical bug in ExecBuildHash32Expr() so that the
initial value is stored directly in the ExprState's result field if
there are no expressions to hash. None of the current users of this
function use an initial value, so the bug is only hypothetical.
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CAApHDvpYSO3kc9UryMevWqthTBrxgfd9djiAjKHMPUSQeX9vdQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hashing of a single Var is a very common operation for ExprState to
perform. Here we add dedicated ExecJust* functions which helps speed up
Hash Joins by removing the interpretation overhead in ExecInterpExpr().
This change currently only affects Hash Joins on a single column. Hash
Joins with multiple join keys or an expression still run through
ExecInterpExpr().
Some testing has shown up to 10% query performance increases on recent AMD
hardware and nearly 7% increase on an Apple M2 for a query performing a
hash join with a large number of lookups on a small hash table.
This change was extracted from a larger patch which adjusts GROUP BY /
hashed subplans / hashed set operations to use ExprState hashing.
Discussion: https://postgr.es/m/CAApHDvr8Zc0ZgzVoCZLdHGOFNhiJeQ6vrUcS9V7N23zMWQb-eA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
Robert Haas expressed concern about whether commit 3eea7a0c9 exposed
the parallel-execution machinery to a case it isn't tested for, namely
a second non-parallel execution of a plan after a parallel execution.
Investigation shows that that can't happen because of pquery.c's
manipulation of the scan direction, but it sure wasn't obvious to
start with. Add some commentary about that.
Discussion: https://postgr.es/m/CA+TgmoagyKQy=HFw+wLo0AKTYWHui+iKswZ8Jnqqd-cFby-WVg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A WITHOUT OVERLAPS primary key or unique constraint is accepted as a
REPLICA IDENTITY, since it guarantees uniqueness. But subscribers
applying logical decoding messages would fail because there was not
support for looking up the equals operator for a gist index. This
fixes that: For GiST indexes we can use the stratnum GiST support
function.
Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
get_equal_strategy_number_for_am() gets the equal strategy number for
an AM. This currently only supports btree and hash. In the more
general case, this also depends on the operator class (see for example
GistTranslateStratnum()). To support that, replace this function with
get_equal_strategy_number() that takes an opclass and derives it from
there. (This function already existed before as a static function, so
the signature is kept for simplicity.)
This patch is only a refactoring, it doesn't add support for other
index AMs such as gist. This will be done separately.
Reviewed-by: Paul Jungwirth <pj@illuminatedcomputing.com>
Reviewed-by: vignesh C <vignesh21@gmail.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
This "shouldn't happen", except right now it can with a temporal gist
index (to be fixed soon), because of missing gist support in
get_equal_strategy_number(). But right now, the error is not caught
right away, but instead you get the subsequent error about a "missing
operator 0". This makes the error more accurate.
Author: Paul Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Our parallel-mode code only works when we are executing a query
in full, so ExecutePlan must disable parallel mode when it is
asked to do partial execution. The previous logic for this
involved passing down a flag (variously named execute_once or
run_once) from callers of ExecutorRun or PortalRun. This is
overcomplicated, and unsurprisingly some of the callers didn't
get it right, since it requires keeping state that not all of
them have handy; not to mention that the requirements for it were
undocumented. That led to assertion failures in some corner
cases. The only state we really need for this is the existing
QueryDesc.already_executed flag, so let's just put all the
responsibility in ExecutePlan. (It could have been done in
ExecutorRun too, leading to a slightly shorter patch -- but if
there's ever more than one caller of ExecutePlan, it seems better
to have this logic in the subroutine than the callers.)
This makes those ExecutorRun/PortalRun parameters unnecessary.
In master it seems okay to just remove them, returning the
API for those functions to what it was before parallelism.
Such an API break is clearly not okay in stable branches,
but for them we can just leave the parameters in place after
documenting that they do nothing.
Per report from Yugo Nagata, who also reviewed and tested
this patch. Back-patch to all supported branches.
Discussion: https://postgr.es/m/20241206062549.710dc01cf91224809dd6c0e1@sraoss.co.jp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When resetting a HashJoin node for rescans, if it is a single-batch
join and there are no parameter changes for the inner subnode, we can
just reuse the existing hash table without rebuilding it. However,
for join types that depend on the inner-tuple match flags in the hash
table, we need to reset these match flags to avoid incorrect results.
This applies to right, right-anti, right-semi, and full joins.
When I introduced "Right Semi Join" plan shapes in aa86129e1, I failed
to reset the match flags in the hash table for right-semi joins in
rescans. This oversight has been shown to produce incorrect results.
This patch fixes it.
Author: Richard Guo
Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
The error detail message "Replica identity consists of an unpublished
generated column." implies that the entire replica identity is made up of
an unpublished generated column which may not be the case.
Reported-by: Peter Smith
Author: Shlok Kyal
Reviewed-by: Peter Smith, Amit Kapila
Discussion: https://postgr.es/m/CAHut+PuwMhKx0PhOA4APhJTLoBGNykbeCQpr_CuwGT-SkswG5w@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When short-circuiting WindowAgg node evaluation on the top-level
WindowAgg node using quals on monotonic window functions, because the
WindowAgg run condition can mean there's no need to evaluate subsequent
window function results in the same partition once the run condition
becomes false, it was possible that the executor would use stale results
from the previous invocation of the window function in some cases.
A fix for this was partially done by a5832722, but that commit only
fixed the issue for non-top-level WindowAgg nodes. I mistakenly thought
that the top-level WindowAgg didn't have this issue, but Jayesh's example
case clearly shows that's incorrect. At the time, I also thought that
this only affected 32-bit systems as all window functions which then
supported run conditions returned BIGINT, however, that's wrong as
ExecProject is still called and that could cause evaluation of any other
window function belonging to the same WindowAgg node, one of which may
return a byref type.
The only queries affected by this are WindowAggs with a "Run Condition"
which contains at least one window function with a byref result type,
such as lead() or lag() on a byref column. The window clause must also
contain a PARTITION BY clause (without a PARTITION BY, execution of the
WindowAgg stops immediately when the run condition becomes false and
there's no risk of using the stale results).
Reported-by: Jayesh Dehankar
Discussion: https://postgr.es/m/193261e2c4d.3dd3cd7c1842.871636075166132237@zohocorp.com
Backpatch-through: 15, where WindowAgg run conditions were added
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure stored generated columns that are part of REPLICA IDENTITY must be
published explicitly for UPDATE and DELETE operations to be published. We
can publish generated columns by listing them in the column list or by
enabling the publish_generated_columns option.
This commit changes the behavior of the test added in commit adedf54e65 by
giving an ERROR for the UPDATE operation in such cases. There is no way to
trigger the bug reported in commit adedf54e65 but we didn't remove the
corresponding code change because it is still relevant when replicating
changes from a publisher with version less than 18.
We decided not to backpatch this behavior change to avoid the risk of
breaking existing output plugins that may be sending generated columns by
default although we are not aware of any such plugin. Also, we didn't see
any reports related to this on STABLE branches which is another reason not
to backpatch this change.
Author: Shlok Kyal, Hou Zhijie
Reviewed-by: Vignesh C, Amit Kapila
Discussion: https://postgr.es/m/CANhcyEVw4V2Awe2AB6i0E5AJLNdASShGfdBLbUd1XtWDboymCA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit d28dff3f6cd6a7562fb2c211ac0fb74a33ffd032.
Quite a large number of buildfarm members didn't like this commit and
it's not yet clear why. Reverting this before too many animals turn
red.
Discussion: https://postgr.es/m/CAApHDvr9i6T5=iAwQCxFDgMsthr_obVxgwBaEJkC8KUH6yM3Hw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new compact_attrs array stores a few select fields from
FormData_pg_attribute in a more compact way, using only 16 bytes per
column instead of the 104 bytes that FormData_pg_attribute uses. Using
CompactAttribute allows performance-critical operations such as tuple
deformation to be performed without looking at the FormData_pg_attribute
element in TupleDesc which means fewer cacheline accesses. With this
change, NAMEDATALEN could be increased with a much smaller negative impact
on performance.
For some workloads, tuple deformation can be the most CPU intensive part
of processing the query. Some testing with 16 columns on a table
where the first column is variable length showed around a 10% increase in
transactions per second for an OLAP type query performing aggregation on
the 16th column. However, in certain cases, the increases were much
higher, up to ~25% on one AMD Zen4 machine.
This also makes pg_attribute.attcacheoff redundant. A follow-on commit
will remove it, thus shrinking the FormData_pg_attribute struct by 4
bytes.
Author: David Rowley
Discussion: https://postgr.es/m/CAApHDvrBztXP3yx=NKNmo3xwFAFhEdyPnvrDg3=M0RhDs+4vYw@mail.gmail.com
Reviewed-by: Andres Freund, Victor Yegorov
|
|
|
|
|
|
|
|
| |
Many of them just seem to have been copied around for no real reason.
Their presence causes (small) risks of hiding actual type mismatches
or silently discarding qualifiers
Discussion: https://www.postgresql.org/message-id/flat/461ea37c-8b58-43b4-9736-52884e862820@eisentraut.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If passed a read-write expanded object pointer, the EEOP_NULLIF
code would hand that same pointer to the equality function
and then (unless equality was reported) also return the same
pointer as its value. This is no good, because a function that
receives a read-write expanded object pointer is fully entitled
to scribble on or even delete the object, thus corrupting the
NULLIF output. (This problem is likely unobservable with the
equality functions provided in core Postgres, but it's easy to
demonstrate with one coded in plpgsql.)
To fix, make sure the pointer passed to the equality function
is read-only. We can still return the original read-write
pointer as the NULLIF result, allowing optimization of later
operations.
Per bug #18722 from Alexander Lakhin. This has been wrong
since we invented expanded objects, so back-patch to all
supported branches.
Discussion: https://postgr.es/m/18722-fd9e645448cc78b4@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a CTE, subquery, sublink, security invoker view, or coercion
projection references a table with row-level security policies, we
neglected to mark the plan as potentially dependent on which role
is executing it. This could lead to later executions in the same
session returning or hiding rows that should have been hidden or
returned instead.
Reported-by: Wolfgang Walther
Reviewed-by: Noah Misch
Security: CVE-2024-10976
Backpatch-through: 12
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two attributes are added to pg_stat_database:
* parallel_workers_to_launch, counting the total number of parallel
workers that were planned to be launched.
* parallel_workers_launched, counting the total number of parallel
workers actually launched.
The ratio of both fields can provide hints that there are not enough
slots available when launching parallel workers, also useful when
pg_stat_statements is not deployed on an instance (i.e. cf54a2c00254).
This commit relies on de3a2ea3b264, that has added two fields to EState,
that get incremented when executing Gather or GatherMerge nodes.
A test is added in select_parallel, where parallel workers are spawned.
Bump catalog version.
Author: Benoit Lobréau
Discussion: https://postgr.es/m/783bc7f7-659a-42fa-99dd-ee0565644e25@dalibo.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit ac04aa84a put the shutoff for this into the planner, which is
not ideal because it doesn't prevent us from re-using a previously
made parallel plan. Revert the planner change and instead put the
shutoff into InitializeParallelDSM, modeling it on the existing code
there for recovering from failure to allocate a DSM segment.
However, that code path is mostly untested, and testing a bit harder
showed there's at least one bug: ExecHashJoinReInitializeDSM is not
prepared for us to have skipped doing parallel DSM setup. I also
thought the Assert in ReinitializeParallelWorkers is pretty
ill-advised, and replaced it with a silent Min() operation.
The existing test case added by ac04aa84a serves fine to test this
version of the fix, so no change needed there.
Patch by me, but thanks to Noah Misch for the core idea that we
could shut off worker creation when !INTERRUPTS_CAN_BE_PROCESSED.
Back-patch to v12, as ac04aa84a was.
Discussion: https://postgr.es/m/CAC-SaSzHUKT=vZJ8MPxYdC_URPfax+yoA1hKTcF4ROz_Q6z0_Q@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
adf97c156 gave ExprStates the ability to hash expressions and return a
single hash value. That commit supports seeding the hash value with an
initial value to have that blended into the final hash value.
Here we fix a hypothetical bug where if there are zero expressions to
hash, the initial value is stored in the wrong location. The existing
code stored the initial value in an intermediate location expecting that
when the expressions were hashed that those steps would store the final
hash value in the ExprState.resvalue field. However, that wouldn't happen
when there are zero expressions to hash. The correct thing to do instead
is to have a special case for zero expressions and when we hit that case,
store the initial value directly in the ExprState.resvalue. The reason
that this is a hypothetical bug is that no code currently calls
ExecBuildHash32Expr passing a non-zero initial value.
Discussion: https://postgr.es/m/CAApHDvpMAL_zxbMRr1LOex3O7Y7R7ZN2i8iUFLQhqQiJMAg3qw@mail.gmail.com
|
|
|
|
|
|
|
|
| |
A few comments contained duplicate "the" in sentences, fix by removing
one occurrence.
Author: Vignesh C <vignesh21@gmail.com>
Discussion: https://postgr.es/m/CALDaNm2aEEiPwGJmPdzBxROVvs8n75yCjKz4K1f1B2TdWpzxTA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bf6c614a2 did some conversion work to use ExprState instead of manually
calling equality functions to check if one set of values is not distinct
from another set. That patch removed many of the fields that became
redundant as a result of that change, but it forgot to remove
SubPlanState.tab_eq_funcs. Fix that.
In passing, fix the header comment for TupleHashEntryData to correctly
spell the field name it's talking about.
Author: Rafia Sabih <rafia.pghackers@gmail.com>
Reviewed-by: Andrei Lepikhov <lepihov@gmail.com>
Discussion: https://postgr.es/m/CA+FpmFeycdombFzrjZw7Rmc29CVm4OOzCWwu=dVBQ6q=PX8SvQ@mail.gmail.com
Discussion: https://postgr.es/m/CAApHDvrWR2jYVhec=COyF2g2BE_ns91NDsCHAMFiXbyhEujKdQ@mail.gmail.com
|
|
|
|
|
|
|
|
| |
as determined by IWYU
These are mostly issues that are new since commit dbbca2cf299.
Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move all responsibility for indicating a block is exhuasted into
table_scan_bitmap_next_tuple() and advance the main iterator in
heap-specific code. This flow control makes more sense and is a step
toward using the read stream API for bitmap heap scans.
Previously, table_scan_bitmap_next_block() returned false to indicate
table_scan_bitmap_next_tuple() should not be called for the tuples on
the page. This happened both when 1) there were no visible tuples on the
page and 2) when the block returned by the iterator was past the end of
the table. BitmapHeapNext() (generic bitmap table scan code) handled the
case when the bitmap was exhausted.
It makes more sense for table_scan_bitmap_next_tuple() to return false
when there are no visible tuples on the page and
table_scan_bitmap_next_block() to return false when the bitmap is
exhausted or there are no more blocks in the table.
As part of this new design, TBMIterateResults are no longer used as a
flow control mechanism in BitmapHeapNext(), so we removed
table_scan_bitmap_next_tuple's TBMIterateResult parameter.
Note that the prefetch iterator is still saved in the
BitmapHeapScanState node and advanced in generic bitmap table scan code.
This is because 1) it was not necessary to change the prefetch iterator
location to change the flow control in BitmapHeapNext() 2) modifying
prefetch iterator management requires several more steps better split
over multiple commits and 3) the prefetch iterator will be removed once
the read stream API is used.
Author: Melanie Plageman
Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas, Mark Dilger
Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
|