| Commit message (Collapse) | Author | Age |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Attempting to use commit timestamps during bootstrapping leads to an
assertion failure, that can be reached for example with an initdb -c
that enables track_commit_timestamp. It makes little sense to register
a commit timestamp for a BootstrapTransactionId, so let's disable the
activation of the module in this case.
This problem has been independently reported once by each author of this
commit. Each author has proposed basically the same patch, relying on
IsBootstrapProcessingMode() to skip the use of commit_ts during
bootstrap. The test addition is a suggestion by me, and is applied down
to v16.
Author: Hayato Kuroda <kuroda.hayato@fujitsu.com>
Author: Andy Fan <zhihuifan1213@163.com>
Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/OSCPR01MB14966FF9E4C4145F37B937E52F5102@OSCPR01MB14966.jpnprd01.prod.outlook.com
Discussion: https://postgr.es/m/87plejmnpy.fsf@163.com
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently check_recovery_target_timeline() converts any value that is
not "current", "latest", or a valid integer to 0. So, for example, the
following configuration added to postgresql.conf followed by a startup:
recovery_target_timeline = 'bogus'
recovery_target_timeline = '9999999999'
... results in the following error patterns:
FATAL: 22023: recovery target timeline 0 does not exist
FATAL: 22023: recovery target timeline 1410065407 does not exist
This is confusing, because the server does not reflect the intention of
the user, and just reports incorrect data unrelated to the GUC.
The origin of the problem is that we do not perform a range check in the
GUC value passed-in for recovery_target_timeline. This commit improves
the situation by using strtou64() and by providing stricter range
checks. Some test cases are added for the cases of an incorrect, an
upper-bound and a lower-bound timeline value, checking the sanity of the
reports based on the contents of the server logs.
Author: David Steele <david@pgmasters.net>
Discussion: https://postgr.es/m/e5d472c7-e9be-4710-8dc4-ebe721b62cea@pgbackrest.org
|
|
|
|
|
|
|
|
|
|
|
| |
Restore nbtree preprocessing comments describing how we mark nbtree row
compare members required to how they were prior to 2016 bugfix commit
a298a1e0.
Oversight in commit bd3f59fd, which made nbtree preprocessing revert to
the original 2006 rules, but neglected to revert these comments.
Backpatch-through: 18
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Recent nbtree bugfix commit 5f4d98d4 added a special case to the code
that sets up a page-level prefix of keys that are definitely satisfied
by every tuple on the page: whenever _bt_set_startikey reached a row
compare key, we'd refuse to apply the pstate.forcenonrequired behavior
in scans where that usually happens (scans with a higher-order array
key). That hack made the scan avoid essentially the same infinite
cycling behavior that also affected nbtree scans with redundant keys
(keys that preprocessing could not eliminate) prior to commit f09816a0.
There are now serious doubts about this row compare workaround.
Testing has shown that a scan with a row compare key and an array key
could still read the same leaf page twice (without the scan's direction
changing), which isn't supposed to be possible following the SAOP
enhancements added by Postgres 17 commit 5bf748b8. Also, we still
allowed a required row compare key to be used with forcenonrequired mode
when its header key happened to be beyond the pstate.ikey set by
_bt_set_startikey, which was complicated and brittle.
The underlying problem was that row compares had inconsistent rules
around how scans start (which keys can be used for initial positioning
purposes) and how scans end (which keys can set continuescan=false).
Quals with redundant keys that could not be eliminated by preprocessing
also had that same quality to them prior to today's bugfix f09816a0. It
now seems prudent to bring row compare keys in line with the new charter
for required keys, by making the start and end rules symmetric.
This commit fixes two points of disagreement between _bt_first and
_bt_check_rowcompare. Firstly, _bt_check_rowcompare was capable of
ending the scan at the point where it needed to compare an ISNULL-marked
row compare member that came immediately after a required row compare
member. _bt_first now has symmetric handling for NULL row compares.
Secondly, _bt_first had its own ideas about which keys were safe to use
for initial positioning purposes. It could use fewer or more keys than
_bt_check_rowcompare. _bt_first now uses the same requiredness markings
as _bt_check_rowcompare for this.
Now that _bt_first and _bt_check_rowcompare agree on how to start and
end scans, we can get rid of the forcenonrequired special case, without
any risk of infinite cycling. This approach also makes row compare keys
behave more like regular scalar keys, particularly within _bt_first.
Fixing these inconsistencies necessitates dealing with a related issue
with the way that row compares were marked required by preprocessing: we
didn't mark any lower-order row members required following 2016 bugfix
commit a298a1e0. That approach was over broad. The bug in question was
actually an oversight in how _bt_check_rowcompare dealt with tuple NULL
values that failed to satisfy a scan key marked required in the opposite
scan direction (it was a bug in 2011 commits 6980f817 and 882368e8, not
a bug in 2006 commit 3a0a16cb). Go back to marking row compare members
as required using the original 2006 rules, and fix the 2016 bug in a
more principled way: by limiting use of the "set continuescan=false with
a key required in the opposite scan direction upon encountering a NULL
tuple value" optimization to the first/most significant row member key.
While it isn't safe to use an implied IS NOT NULL qualifier to end the
scan when it comes from a required lower-order row compare member key,
it _is_ generally safe for such a required member key to end the scan --
provided the key is marked required in the _current_ scan direction.
This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275. It is a direct follow-up to today's commit f09816a0.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=pcijHL_mA0_TJ5LiTB28QpQ0cGtT-ccFV=KzuunNDDQ@mail.gmail.com
Backpatch-through: 18
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
nbtree preprocessing's handling of redundant (and contradictory) keys
created problems for scans with = arrays. It was just about possible
for a scan with an = array key and one or more redundant keys (keys that
preprocessing could not eliminate due an incomplete opfamily and a
cross-type key) to get stuck. Testing has shown that infinite cycling
where the scan never manages to make forward progress was possible.
This could happen when the scan's arrays were reset in _bt_readpage's
forcenonrequired=true path (added by bugfix commit 5f4d98d4) when the
arrays weren't at least advanced up to the same point that they were in
at the start of the _bt_readpage call. Earlier redundant keys prevented
the finaltup call to _bt_advance_array_keys from reaching lower-order
keys that needed to be used to sufficiently advance the scan's arrays.
To fix, make preprocessing leave the scan's keys in a state that is as
close as possible to how it'll usually leave them (in the common case
where there's no redundant keys that preprocessing failed to eliminate).
Now nbtree preprocessing _reliably_ leaves behind at most one required
>/>= key per index column, and at most one required </<= key per index
column. Columns that have one or more = keys that are eligible to be
marked required (based on the traditional rules) prioritize the = keys
over redundant inequality keys; they'll _reliably_ be left with only one
of the = keys as the index column's only required key.
Keys that are not marked required (whether due to the new preprocessing
step running or for some other reason) are relocated to the end of the
so->keyData[] array as needed. That way they'll always be evaluated
after the scan's required keys, and so cannot prevent code in places
like _bt_advance_array_keys and _bt_first from reaching a required key.
Also teach _bt_first to decide which initial positioning keys to use
based on the same requiredness markings that have long been used by
_bt_checkkeys/_bt_advance_array_keys. This is a necessary condition for
reliably avoiding infinite cycling. _bt_advance_array_keys expects to
be able to reason about what'll happen in the next _bt_first call should
it start another primitive index scan, by evaluating inequality keys
that were marked required in the opposite-to-scan scan direction only.
Now everybody (_bt_first, _bt_checkkeys, and _bt_advance_array_keys)
will always agree on which exact key will be used on each index column
to start and/or end the scan (except when row compare keys are involved,
which have similar problems not addressed by this commit).
An upcoming commit will finish off the work started by this commit by
harmonizing how _bt_first, _bt_checkkeys, and _bt_advance_array_keys
apply row compare keys to start and end scans.
This fixes what was arguably an oversight in either commit 5f4d98d4 or
commit 8a510275.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Discussion: https://postgr.es/m/CAH2-Wz=ds4M+3NXMgwxYxqU8MULaLf696_v5g=9WNmWL2=Uo2A@mail.gmail.com
Backpatch-through: 18
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit c120550edb86 optimized the vacuuming of relations without
indexes (a.k.a. one-pass strategy) by directly marking dead item IDs
as LP_UNUSED. However, the periodic FSM vacuum was still checking if
dead item IDs had been marked as LP_DEAD when attempting to vacuum the
FSM every VACUUM_FSM_EVERY_PAGES blocks. This condition was never met
due to the optimization, resulting in missed FSM vacuum
opportunities.
This commit modifies the periodic FSM vacuum condition to use the
number of tuples deleted during HOT pruning. This count includes items
marked as either LP_UNUSED or LP_REDIRECT, both of which are expected
to result in new free space to report.
Back-patch to v17 where the vacuum optimization for tables with no
indexes was introduced.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoBL8m6B9GSzQfYxVaEgvD7-Kr3AJaS-hJPHC+avm-29zw@mail.gmail.com
Backpatch-through: 17
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit refactors the vacuum routines that rely on VacuumParams,
adding const markers where necessary to force a new policy in the code.
This structure should not use a pointer as it may be used across
multiple relations, and its contents should never be updated.
vacuum_rel() stands as an exception as it touches the "index_cleanup"
and "truncate" options.
VacuumParams has been introduced in 0d831389749a, and 661643dedad9 has
fixed a bug impacting VACUUM operating on multiple relations. The
changes done in tableam.h break ABI compatibility, so this commit can
only happen on HEAD.
Author: Shihao Zhong <zhong950419@gmail.com>
Co-authored-by: Michael Paquier <michael@paquier.xyz>
Reviewed-by: Nathan Bossart <nathandbossart@gmail.com>
Reviewed-by: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAGRkXqTo+aK=GTy5pSc-9cy8H2F2TJvcrZ-zXEiNJj93np1UUw@mail.gmail.com
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
8e03eb92e9a reverted the commit 39b66a91bd which allowed freezing
in the heap_insert() code path but forgot to remove the corresponding
check in heap_xlog_insert(). This code is extraneous but not harmful.
However, cleaning it up makes it very clear that, as of now, we do not
support any freezing of pages in the heap_insert() path.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/flat/CAAKRu_Zp4Pi-t51OFWm1YZ-cctDfBhHCMZ%3DEx6PKxv0o8y2GvA%40mail.gmail.com
Backpatch-through: 14
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can simplify the VM counters added in dc6acfd910b8 to
lazy_vacuum_heap_page() and lazy_scan_new_or_empty().
We won't invoke lazy_vacuum_heap_page() unless there are dead line
pointers, so we know the page can't be all-visible.
In lazy_scan_new_or_empty(), we only update the VM if the page-level
hint PD_ALL_VISIBLE is clear, and the VM bit cannot be set if the page
level bit is clear because a subsequent page update would fail to clear
the visibility map bit.
Simplify the logic for determining which log counters to increment based
on this knowledge. Doing so is worthwhile because the old logic was
confusing and misguided.
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Nazir Bilal Yavuz <byavuz81@gmail.com>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Discussion: https://postgr.es/m/flat/CAAKRu_a9w_n2mwY%3DG4LjfWTvRTJtjbfvnYAKi4WjO8QXHHrA0g%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Sometimes the TupleDesc used in verify_compact_attribute() is shared
among backends, and since CompactAttribute.attcacheoff gets updated
during tuple deformation, it was possible that another backend would
set attcacheoff on a given CompactAttribute in the small window of time
from when the attcacheoff from the live CompactAttribute was being set
in the 'tmp' CompactAttribute and before the Assert verifying that the
live and tmp CompactAttributes matched.
Here we adjust the code to make a copy of the live CompactAttribute so
that we're not trying to Assert against a shared copy of it.
Author: David Rowley <dgrowleyml@gmail.com>
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/7195e408-758c-4031-8e61-4f842c716ac0@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new tests verify that logical and physical replication slots are still
valid after an immediate restart on checkpoint completion when the slot was
advanced during the checkpoint.
This commit introduces two new injection points to make these tests possible:
* checkpoint-before-old-wal-removal - triggered in the checkpointer process
just before old WAL segments cleanup;
* logical-replication-slot-advance-segment - triggered in
LogicalConfirmReceivedLocation() when restart_lsn was changed enough to
point to the next WAL segment.
Discussion: https://postgr.es/m/flat/1d12d2-67235980-35-19a406a0%4063439497
Author: Vitaly Davydov <v.davydov@postgrespro.ru>
Author: Tomas Vondra <tomas@vondra.me>
Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
Reviewed-by: Amit Kapila <amit.kapila16@gmail.com>
Backpatch-through: 17
|
|
|
|
|
|
|
|
|
|
|
| |
_bt_readnextpage expects so->currPos.buf to be InvalidBuffer (and for
the position's page to be unlocked) when called. However, it does not
expect there to be no pins held on any page. In particular, so->markPos
might hold a separate pin, both before and after the call. Fix some
comments that seemed to suggest otherwise.
Follow-up commit to commit 7c319f54, which made _bt_killitems drop pins
it acquired itself.
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 54c6ea8c81db718508eeea50991d3c1c5dff54a5.
Further analysis has shown that the forcenonrequired row compare
behavior is in fact necessary, despite the new restrictions on
RowCompares imposed by _bt_set_startikey following commit 5f4d98d4.
Discussion: https://postgr.es/m/CAH2-Wzm3bKcz3TbHGem3_+SinEyG=VZVPbApQghp7YiZj+MM3g@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called. In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.
Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan. so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.
This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e. The test case in question involves the use
of mark and restore. An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin. A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(non-assert builds got a "resource was not closed" WARNING instead).
The same problem exists on earlier releases, though the issue is far
more subtle there. Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage. This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
_bt_killitems call will neglect to check if so->currPos.lsn continues to
match the page's now-current LSN.
As a result of all this, it's just about possible that _bt_killitems
will set the wrong items LP_DEAD (on release branches). This could only
happen with merge joins (the sole user of nbtree mark/restore support),
when a concurrently inserted index tuple used a recently-recycled TID
(and only when the new tuple was inserted onto the same page as a
distinct concurrently-removed tuple with the same TID). This is exactly
the scenario that _bt_killitems' check of the page's now-current LSN
against the LSN stashed in currPos was supposed to prevent.
A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set. All call sites that might need to drop a still-held pin will be
taught to rely on the scan-level so->dropPin field recently introduced
by commit e6eed40e. That will make bugs of the same general nature as
this one impossible (or make them much easier to detect, at least).
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
| |
We concluded that commit e5a3c9d9b is a feature rather than a fix; since
it was added after feature freeze, revert it.
Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Reported-by: Michael Paquier <michael@paquier.xyz>
Reported-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/ed2296f1-1a6b-4932-b870-5bb18c2591ae%40oss.nttdata.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Delay calling BufferGetLSNAtomic() until we finish reading a page that
actually contains items that btgettuple will return to the executor.
This reduces the number of calls during plain index scans (we'll only
call BufferGetLSNAtomic() when _bt_readpage returns true), and totally
eliminates calls during index-only scans, bitmap index scans, and plain
index scans of an unlogged relation.
Currently, when checksums (or wal_log_hints) are enabled, acquiring a
page's LSN in BufferGetLSNAtomic() involves locking the buffer header
(which involves the use of spinlocks). Testing has shown that enabling
page-level checksums causes large regressions with certain workloads,
especially on larger multi-socket systems.
The regression isn't tied to any Postgres 18 commit. However, Postgres
18 commit 04bec894 made initdb use checksums by default, so it seems
prudent to address the problem now.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com
|
|
|
|
|
|
|
| |
Use of a RowCompare key makes nbtree index scans ineligible to use
pstate.forcenonrequired following recent bugfix commit 5f4d98d4.
There's no longer any need for _bt_check_rowcompare to accept a
forcenonrequired argument, so remove it.
|
|
|
|
|
|
|
|
|
|
|
| |
This commit renames the GUC log_lock_failure to log_lock_failures
to align with the existing similar setting log_lock_waits, which uses
the plural form. This improves naming consistency across related GUCs.
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Author: Fujii Masao <masao.fujii@gmail.com
Reviewed-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://postgr.es/m/7a8198b6-d5b8-4910-b41e-8d3efcbb015d@eisentraut.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We store values for these options as array elements with the syntax
"name=value", hence a name containing "=" confuses matters when
it's time to read the array back in. Since validation of the
options is often done (long) after this conversion to array format,
that leads to confusing and off-point error messages. We can
improve matters by rejecting names containing "=" up-front.
(Probably a better design would have involved pairs of array
elements, but it's too late now --- and anyway, there's no
evident use-case for option names like this. We already
reject such names in some other contexts such as GUCs.)
Reported-by: Chapman Flack <jcflack@acm.org>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Reviewed-by: Chapman Flack <jcflack@acm.org>
Discussion: https://postgr.es/m/6830EB30.8090904@acm.org
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
052026c9b9 mistakenly reordered setup steps in heap_vacuum_rel(),
incorrectly moving RelationGetNumberOfBlocks() before
vacuum_get_cutoffs().
OldestXmin must be determined before RelationGetNumberOfBlocks()
calculates the number of blocks in the relation that will be vacuumed.
Otherwise tuples older than OldestXmin may be inserted into the end of
the relation into blocks that are not vacuumed. If additional tuples
newer than those inserted into unscanned blocks but older than
OldestXmin are inserted into free space earlier in the relation, the
result could be advancing pg_class.relfrozenxid to a newer value than an
unfrozen XID in one of the unscanned heap pages.
Assigning an incorrect relfrozenxid can lead to data loss, so it is
imperative that it correctly reflect the oldest unfrozen xid.
Reported-by: Peter Geoghegan <pg@bowt.ie>
Author: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzntqvVEdbbpqG5JqSZGuLWmy4PBfUO-OswfivKchr2gvw%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 7406ab623fe added a gist support function that we internally
refer to by the symbol GIST_STRATNUM_PROC. This translated from
"well-known" strategy numbers to opfamily-specific strategy numbers.
However, we later (commit 630f9a43cec) changed this to fit into
index-AM-level compare type mapping, so this function actually now
maps from compare type to opfamily-specific strategy numbers. So this
name is no longer fitting.
Moreover, the index AM level also supports the opposite, a function to
map from strategy number to compare type. This is currently not
supported in gist, but one might wonder what this function is supposed
to be called when it is added.
This patch changes the naming of the gist-level functionality to be
more in line with the index-AM-level functionality. This makes sense
because these are essentially the same thing on different levels.
This also changes the names of the externally visible functions that
are provided for use as such a support function.
Reviewed-by: Paul A Jungwirth <pj@illuminatedcomputing.com>
Discussion: https://www.postgresql.org/message-id/37ebb1d9-9036-485f-a215-e55435689917%40eisentraut.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, postgres_fdw always 1) opened a remote transaction in READ
WRITE mode even when the local transaction was READ ONLY, causing a READ
ONLY transaction using it that references a foreign table mapped to a
remote view executing a volatile function to write in the remote side,
and 2) opened the remote transaction in NOT DEFERRABLE mode even when
the local transaction was DEFERRABLE, causing a SERIALIZABLE READ ONLY
DEFERRABLE transaction using it to abort due to a serialization failure
in the remote side.
To avoid these, modify postgres_fdw to open a remote transaction in the
same access/deferrable modes as the local transaction. This commit also
modifies it to open a remote subtransaction in the same access mode as
the local subtransaction.
Although these issues exist since the introduction of postgres_fdw,
there have been no reports from the field. So it seems fine to just fix
them in master only.
Author: Etsuro Fujita <etsuro.fujita@gmail.com>
Reviewed-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAPmGK16n_hcUUWuOdmeUS%2Bw4Q6dZvTEDHb%3DOP%3D5JBzo-M3QmpQ%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A few places that access system catalogs don't set up an active
snapshot before potentially accessing their TOAST tables. To fix,
push an active snapshot just before each section of code that might
require accessing one of these TOAST tables, and pop it shortly
afterwards. While at it, this commit adds some rather strict
assertions in an attempt to prevent such issues in the future.
Commit 16bf24e0e4 recently removed pg_replication_origin's TOAST
table in order to fix the same problem for that catalog. On the
back-branches, those bugs are left in place. We cannot easily
remove a catalog's TOAST table on released major versions, and only
replication origins with extremely long names are affected. Given
the low severity of the issue, fixing older versions doesn't seem
worth the trouble of significantly modifying the patch.
Also, on v13 and v14, the aforementioned strict assertions have
been omitted because commit 2776922201, which added
HaveRegisteredOrActiveSnapshot(), was not back-patched. While we
could probably back-patch it now, I've opted against it because it
seems unlikely that new TOAST snapshot issues will be introduced in
the oldest supported versions.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/18127-fe54b6a667f29658%40postgresql.org
Discussion: https://postgr.es/m/18309-c0bf914950c46692%40postgresql.org
Discussion: https://postgr.es/m/ZvMSUPOqUU-VNADN%40nathan
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
uint64 was perhaps chosen in cff440d36 as the type was uint32 prior to
that widening work.
Having this as uint64 doesn't make much sense and just adds the overhead of
having to remember that we always output this in its signed form. Let's
remove that overhead.
The signed form output is seemingly required since we have no way to
represent the full range of uint64 in an SQL type. We use BIGINT in places
like pg_stat_statements, which maps directly to int64.
The release notes "Source Code" section may want to mention this
adjustment as some extensions may wish to adjust their code.
Author: David Rowley <dgrowleyml@gmail.com>
Suggested-by: Peter Eisentraut <peter@eisentraut.org>
Reviewed-by: Sami Imseih <samimseih@gmail.com>
Reviewed-by: Michael Paquier <michael@paquier.xyz>
Discussion: https://postgr.es/m/50cb0c8b-994b-48f9-a1c4-13039eb3536b@eisentraut.org
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, we asserted that the eager scan's success and failure
counters were positive before decrementing them. However, this
assumption was incorrect, as it's possible that some blocks have
already been eagerly scanned by the time eager scanning is disabled.
This commit replaces the assertions with guards to handle this
scenario gracefully.
With this change, we continue to allow read-ahead operations by the
read stream that exceed the success and failure caps. While there is a
possibility that overruns will trigger eager scans of additional
pages, this does not pose a practical concern as the overruns will not
be substantial and remain within an acceptable range.
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAD21AoConf6tkVCv-=JhQJj56kYsDwo4jG5+WqgT+ukSkYomSQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since commit 8b1dccd37c7, the PREPARE TRANSACTION WAL record includes
information about dropped statistics entries. However, the WAL resource
manager description function for PREPARE TRANSACTION record failed to
parse this information correctly and always assumed there were
no such entries.
As a result, for example, pg_waldump could not display the dropped
statistics entries stored in PREPARE TRANSACTION records.
The root cause was that ParsePrepareRecord() did not set the number of
statistics entries to drop on commit or abort. These values remained
zero-initialized and were never updated from the parsed record.
This commit fixes the issue by properly setting those values during parsing.
With this fix, pg_waldump can now correctly report dropped statistics
entries in PREPARE TRANSACTION records.
Back-patch to v15, where commit 8b1dccd37c7 was introduced.
Author: Daniil Davydov <3danissimo@gmail.com>
Reviewed-by: Fujii Masao <masao.fujii@gmail.com>
Discussion: https://postgr.es/m/CAJDiXgh-6Epb2XiJe4uL0zF-cf0_s_7Lw1TfEHDMLzYjEmfGOw@mail.gmail.com
Backpatch-through: 15
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macros INJECTION_POINT() and INJECTION_POINT_CACHED() are extended
with an optional argument that can be passed down to the callback
attached when an injection point is run, giving to callbacks the
possibility to manipulate a stack state given by the caller. The
existing callbacks in modules injection_points and test_aio have their
declarations adjusted based on that.
da7226993fd4 (core AIO infrastructure) and 93bc3d75d8e1 (test_aio) and
been relying on a set of workarounds where a static variable called
pgaio_inj_cur_handle is used as runtime argument in the injection point
callbacks used by the AIO tests, in combination with a TRY/CATCH block
to reset the argument value. The infrastructure introduced in this
commit will be reused for the AIO tests, simplifying them.
Reviewed-by: Greg Burd <greg@burd.me>
Discussion: https://postgr.es/m/Z_y9TtnXubvYAApS@paquier.xyz
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
nbtree array index scans could fail to return matching tuples in rare
cases where the missed tuples cover key space that the scan's arrays
incorrectly indicate has already been read. These cases involved nearby
tuples with NULL values that were evaluated using a skip array key while
in pstate.forcenonrequired mode.
To fix, prevent forcenonrequired mode from prematurely advancing the
scan's array keys beyond key space that the scan has yet to read tuples
from: reset the scan's array keys (to the first elements in the current
scan direction) before the _bt_checkkeys call for pstate.finaltup. That
way _bt_checkkeys starts from a clean slate, which ensures that it will
call _bt_advance_array_keys (while passing it sktrig_required=true).
This reliably restores the invariant that the scan's arrays always
accurately track its progress through the index's key space (at least
when the scan is "between pages").
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Be more conservative when performing a scheduled recheck of an nbtree
scan's array keys once on the next page, having set so->scanBehind: back
out of reading the page (perform another primitive scan instead) when
the next page's high key/finaltup has an untruncated prefix of matching
values and truncated suffix attributes associated with lower-order keys.
In other words, stop assuming that the lower-order keys have been
satisfied by the truncated suffix attributes in this context (only do so
when considering scheduling a recheck within _bt_advance_array_keys).
The new behavior is more logical: if the next page read after setting
so->scanBehind can only contain tuples that are themselves "behind the
scan", that's reason enough to cut our losses. In general, when we set
so->scanBehind, we only expect to perform one recheck on the next page
to make a final decision about whether or not to continue the current
primitive index scan. It seems unprincipled for the recheck to allow a
_bt_readpage to continue unless the scan's arrays will advance/unless
the page might actually contain relevant tuples.
In practice it is highly unlikely that things will line up like this
(the untruncated prefix of attribute values from the next page's high
key is seldom an exact match for their corresponding array's current
element following array advancement on the original/previous page).
That gives us all the more reason to keep things simple and consistent.
This was arguably an oversight in commit 9a2e2a285a, which improved
nbtree array primitive scan scheduling.
Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-WzkXzJajgyW-pCQ7vaDPhaT3huU+Zw_j448rpCBEsu2YOQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consistently prevent nbtree array advancement from treating a scankey as
required when operating in pstate.forcenonrequired mode. Otherwise, we
risk a NULL pointer dereference. This was possible in the path where
_bt_check_compare is called to recheck a tuple that advanced all of the
scan's arrays to matching values: its continuescan=false handling
expects _bt_advance_array_keys to have been called with a valid pstate,
but it'll always be NULL during sktrig_required=false calls (which is
how _bt_advance_array_keys must be called when pstate.forcenonrequired).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKsn2W=gPBmj7p6MjQFvxB+zZDBkwTSg0o3f5Hh8rkRrsA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WzmodSE+gpTd1CRGU9ez8ytyyDS+Kns2r9NzgUp1s56kpw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To insert the merged GIN entries in _gin_parallel_merge, the leader
calls ginEntryInsert(). This may allocate memory, e.g. for a new leaf
tuple. This was allocated in the PortalContext, and kept until the end
of the index build. For most GIN indexes the amount of leaked memory is
negligible, but for custom opclasses with large keys it may cause OOMs.
Fixed by calling ginEntryInsert() in a temporary memory context, reset
after each insert. Other ginEntryInsert() callers do this too, except
that the context is reset after batches of inserts. More frequent resets
don't seem to hurt performance, it may even help it a bit.
Report and fix by Vinod Sridharan.
Author: Vinod Sridharan <vsridh90@gmail.com>
Reviewed-by: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/CAFMdLD4p0VBd8JG=Nbi=BKv6rzFAiGJ_sXSFrw-2tNmNZFO5Kg@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make an nbtree array preprocessing assertion account for scans that add
fewer skip arrays than initially expected due to preprocessing finding
an unsatisfiable array qual.
Oversight in commit 92fe23d9.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKtQMhHy5qcB3KqCcGiW-Rp8P7KzUFRa9ZMKUiv6zen7LQ@mail.gmail.com
|
|
|
|
|
| |
Author: Junwang Zhao <zhjwpku@gmail.com>
Discussion: https://postgr.es/m/CAEG8a3+MRwDKc4YSFKKPKq7Y+vMufVC5u94wM5KZPB2CbgCxnQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
| |
Index vacuuming and [auto]prewarm AIO concurrency should be governed by
maintenance_io_concurrency. As such, pass those read stream users the
READ_STREAM_MAINTENANCE flag which will calculate their read stream
distance with maintenance_io_concurrency instead of
effective_io_concurrency. This was an oversight in the original commits
making those operations use the read stream API.
Discussion: https://postgr.es/m/flat/CAAKRu_aopDxTo4b41Mt_7Zc-z0_ngocrY8SFCCY6Aph1HgwuNw%40mail.gmail.com
|
|
|
|
|
|
|
|
|
| |
Checking if another primitive scan is required after all once the next
leaf page was moved from _bt_checkkeys to its _bt_readpage caller by
commit 9a2e2a28. Update a comment that incorrectly described the
recheck mechanism as something that takes place in _bt_checkkeys.
Also fix an older typo in related code comments.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
_bt_check_compare neglected to handle a case that can arise when the
scan's keys are temporarily treated as nonrequired, as an optimization:
whenever a NULL tuple value was encountered that had a skip array whose
current element wasn't already NULL, _bt_check_compare failed to advance
the array to the NULL element. This allowed _bt_check_compare to fail
to return matching tuples containing a NULL value (though only with an
array column that came before a skip array column with NULLs, and only
during _bt_readpage calls that set pstate.forcenonrequired=true on a
page where the higher-order column also had to advance).
To fix, teach _bt_check_compare to handle this case just like any other
case where a skip array key is unsatisfied and must be advanced directly
(due to the key being considered a nonrequired key).
Oversight in commit 8a510275, which optimized nbtree search scan key
comparisons with skip arrays.
Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Mark Dilger <mark.dilger@enterprisedb.com>
Discussion: https://postgr.es/m/CAHgHdKtLFWZcjr87hMH0hYDHgcifu4Tj7iHz-xh8qsJREt5cqA@mail.gmail.com
|
|
|
|
|
|
|
| |
These are all new to v18
Author: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/CAApHDvrMcr8XD107H3NV=WHgyBcu=sx5+7=WArr-n_cWUqdFXQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Blocking checkpoint phase 2 requires MarkBufferDirty() and
BUFFER_LOCK_EXCLUSIVE; neither suffices by itself. transam/README documents
this, citing SyncOneBuffer(). Update the DELAY_CHKPT_START documentation to
say this. Expand the heap_inplace_update_and_unlock() comment that cites
XLogSaveBufferForHint() as precedent, since heap_inplace_update_and_unlock()
could have opted not to use DELAY_CHKPT_START.
Commit 8e7e672cdaa6bfec85d4d5dd9be84159df23bb41 added DELAY_CHKPT_START to
heap_inplace_update_and_unlock(). Since commit
bc6bad88572501aecaa2ac5d4bc900ac0fd457d5 reverted it in non-master branches,
no back-patch.
Discussion: https://postgr.es/m/20250406180054.26.nmisch@google.com
|
|
|
|
|
|
|
|
| |
The large majority of these have been introduced by recent commits done
in the v18 development cycle.
Author: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/9a7763ab-5252-429d-a943-b28941e0e28b@gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 0f21db36d made an assumption that GIN triConsistentFns
would not modify their input entryRes[] arrays. But in fact,
the "shim" triConsistentFn that we use for opclasses that don't
supply their own did exactly that, potentially leading to wrong
answers from a GIN index search. Through bad luck, none of the
test cases that we have for such opclasses exposed the bug.
One response to this could be that the assumption of consistency check
functions not modifying entryRes[] arrays is a bad one, but it still
seems reasonable to me. Notably, shimTriConsistentFn is itself
assuming that with respect to the underlying boolean consistentFn,
so it's sure being self-centered in supposing that it gets to do so.
Fortunately, it's quite simple to fix shimTriConsistentFn to restore
the entry-time state of entryRes[], so let's do that instead.
This issue doesn't affect any core GIN opclasses, since they all
supply their own triConsistentFns. It does affect contrib modules
btree_gin, hstore, and intarray.
Along the way, I (tgl) noticed that shimTriConsistentFn failed to
pick up on a "recheck" flag returned by its first call to the boolean
consistentFn. This may be only a latent problem, since it would be
unlikely for a consistentFn to set recheck for the all-false case
and not any other cases. (Indeed, none of our contrib modules do
that.) Nonetheless, it's formally wrong.
Reported-by: Vinod Sridharan <vsridh90@gmail.com>
Author: Vinod Sridharan <vsridh90@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/CAFMdLD7XzsXfi1+DpTqTgrD8XU0i2C99KuF=5VHLWjx4C1pkcg@mail.gmail.com
Backpatch-through: 13
|
|
|
|
|
|
|
|
|
|
| |
Make sure that function declarations use names that exactly match the
corresponding names from function definitions in a few places. These
inconsistencies were all introduced during Postgres 18 development.
This commit was written with help from clang-tidy, by mechanically
applying the same rules as similar clean-up commits (the earliest such
commit was commit 035ce1fe).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows them to be added without scanning the table, and validating
them afterwards without holding access exclusive lock on the table after
any violating rows have been deleted or fixed.
Doing ALTER TABLE ... SET NOT NULL for a column that has an invalid
not-null constraint validates that constraint. ALTER TABLE .. VALIDATE
CONSTRAINT is also supported. There are various checks on whether an
invalid constraint is allowed in a child table when the parent table has
a valid constraint; this should match what we do for enforced/not
enforced constraints.
pg_attribute.attnotnull is now only an indicator for whether a not-null
constraint exists for the column; whether it's valid or invalid must be
queried in pg_constraint. Applications can continue to query
pg_attribute.attnotnull as before, but now it's possible that NULL rows
are present in the column even when that's set to true.
For backend internal purposes, we cache the nullability status in
CompactAttribute->attnullability that each tuple descriptor carries
(replacing CompactAttribute.attnotnull, which was a mirror of
Form_pg_attribute.attnotnull). During the initial tuple descriptor
creation, based on the pg_attribute scan, we set this to UNRESTRICTED if
pg_attribute.attnotnull is false, or to UNKNOWN if it's true; then we
update the latter to VALID or INVALID depending on the pg_constraint
scan. This flag is also copied when tupledescs are copied.
Comparing tuple descs for equality must also compare the
CompactAttribute.attnullability flag and return false in case of a
mismatch.
pg_dump deals with these constraints by storing the OIDs of invalid
not-null constraints in a separate array, and running a query to obtain
their properties. The regular table creation SQL omits them entirely.
They are then dealt with in the same way as "separate" CHECK
constraints, and dumped after the data has been loaded. Because no
additional pg_dump infrastructure was required, we don't bump its
version number.
I decided not to bump catversion either, because the old catalog state
works perfectly in the new world. (Trying to run with new catalog state
and the old server version would likely run into issues, however.)
System catalogs do not support invalid not-null constraints (because
commit 14e87ffa5c54 didn't allow them to have pg_constraint rows
anyway.)
Author: Rushabh Lathia <rushabh.lathia@gmail.com>
Author: Jian He <jian.universality@gmail.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Tested-by: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Discussion: https://postgr.es/m/CAGPqQf0KitkNack4F5CFkFi-9Dqvp29Ro=EpcWt=4_hs-Rt+bQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There were several places in ordering-related planning where a
requirement for btree was hardcoded but an amcanorder index could
suffice. This fixes that. We just need to do the necessary mapping
between strategy numbers and compare types and adjust some related
APIs so that this works independent of btree strategy numbers. For
instance, non-btree amcanorder indexes can now be used to support
sorting and merge joins. Also, predtest.c works independent of btree
strategy numbers now.
To avoid performance regressions, some details on btree and other
built-in index types are still hardcoded as shortcuts, but other index
types now have access to the same features by providing the required
flags and callbacks.
Author: Mark Dilger <mark.dilger@enterprisedb.com>
Co-authored-by: Peter Eisentraut <peter@eisentraut.org>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transform low_compare and high_compare nbtree skip array inequalities
(with opclasses that offer skip support) in such a way as to allow
_bt_first to consistently apply later keys when it descends the tree.
This can lower the number of index searches for multi-column scans that
use a ">" key on one of the index's prefix columns (or use a "<" key,
when scanning backwards) when it precedes some later lower-order key.
For example, an index qual "WHERE a > 5 AND b = 2" will now be converted
to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place
after low_compare and high_compare have been finalized. That way, the
initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an
initial position, rather than just using "WHERE a > 5" -- "b = 2" can be
applied during every _bt_first call. There's a decent chance that this
will allow such a scan to avoid the extra search that might otherwise be
needed to determine the lowest "a" value still satisfying "WHERE a > 5".
The transformation process can only lower the total number of index
pages read when the use of a more restrictive set of initial positioning
keys in _bt_first actually allows the scan to land on some later leaf
page directly, relative to the unoptimized case (or on an earlier leaf
page directly, when scanning backwards). But the savings can really add
up in cases where an affected skip array comes after some other array.
For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2"
can save as many as 3 _bt_first calls by applying the new transformation
to its "y" array (up to 1 extra search can be avoided per "x" element).
Follow-up to commit 92fe23d9, which added nbtree skip scan.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Don't allow nbtree scans with skip arrays to end any primitive scan on
its first leaf page without giving some consideration to how many times
the scan's arrays advanced while changing at least one skip array
(though continue not caring about the number of array advancements that
only affected SAOP arrays, even during skip scans with SAOP arrays).
Now when a scan performs more than 3 such array advancements in the
course of reading a single leaf page, it is taken as a signal that the
next page is unlikely to be skippable. We'll therefore continue the
ongoing primitive index scan, at least until we can perform a recheck
against the next page's finaltup.
Testing has shown that this new heuristic occasionally makes all the
difference with skip scans that were expected to rely on the "passed
first page" heuristic added by commit 9a2e2a28. Without it, there is a
remaining risk that certain kinds of skip scans will never quite manage
to clear the initial hurdle of performing a primitive scan that lasts
beyond its first leaf page (or that such a skip scan will only clear
that initial hurdle when it has already wasted noticeably-many cycles
due to inefficient primitive scan scheduling).
Follow-up to commits 92fe23d9 and 9a2e2a28.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Postgres 17 commit e0b1ee17 added two complementary optimizations to
nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage
was made to avoid needlessly evaluating keys that are guaranteed to be
satisfied by applying page-level context. "prechecked" did this for
keys required in the current scan direction, while "firstmatch" did it
for keys required in the opposite-to-scan direction only.
The "prechecked" design had a number of notable issues. It didn't
account for the fact that an = array scan key's sk_argument field might
need to advance at the point of the page precheck (it didn't check the
precheck tuple against the key's array, only the key's sk_argument,
which needlessly made it ineffective in cases involving stepping to a
page having advanced the scan's arrays using a truncated high key).
"prechecked" was also completely ineffective when only one scan key
wasn't guaranteed to be satisfied by every tuple (it didn't recognize
that it was still safe to avoid evaluating other, earlier keys).
The "firstmatch" optimization had similar limitations. It could only be
applied after _bt_readpage found its first matching tuple, regardless of
why any earlier tuples failed to satisfy the scan's index quals. This
allowed unsatisfied non-required scan keys to impede the optimization.
Replace both optimizations with a new optimization, without any of these
limitations: the "startikey" optimization. Affected _bt_readpage calls
generate a page-level key offset ("startikey"), that their _bt_checkkeys
calls can then start at. This is an offset to the first key that isn't
known to be satisfied by every tuple on the page.
Although this is independently useful work, its main goal is to avoid
performance regressions with index scans that use skip arrays, but still
never manage to skip over irrelevant leaf pages. We must avoid wasting
CPU cycles on overly granular skip array maintenance in these cases.
The new "startikey" optimization helps with this by selectively
disabling array maintenance for the duration of a _bt_readpage call.
This has no lasting consequences for the scan's array keys (they'll
still reliably track the scan's progress through the index's key space
whenever the scan is "between pages").
Skip scan adds skip arrays during preprocessing using simple, static
rules, and decides how best to navigate/apply the scan's skip arrays
dynamically, at runtime. The "startikey" optimization enables this
approach. As a result of all this, the planner doesn't need to generate
distinct, competing index paths (one path for skip scan, another for an
equivalent traditional full index scan). The overall effect is to make
scan runtime close to optimal, even when the planner works off an
incorrect cardinality estimate. Scans will also perform well given a
skipped column with data skew: individual groups of pages with many
distinct values (in respect of a skipped column) can be read about as
efficiently as before -- without the scan being forced to give up on
skipping over other groups of pages that are provably irrelevant.
Many scans that cannot possibly skip will still benefit from the use of
skip arrays, since they'll allow the "startikey" optimization to be as
effective as possible (by allowing preprocessing to mark all the scan's
keys as required). A scan that uses a skip array on "a" for a qual
"WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now,
even when every tuple read by the scan has its own distinct "a" value.
However, there are still some remaining regressions, affecting certain
trickier cases.
Scans whose index quals have several range skip arrays, each on some
high cardinality column, can still be slower than they were before the
introduction of skip scan -- even with the new "startikey" optimization.
There are also known regressions affecting very selective index scans
that use a skip array. The underlying issue with such selective scans
is that they never get as far as reading a second leaf page, and so will
never get a chance to consider applying the "startikey" optimization.
In principle, all regressions could be avoided by teaching preprocessing
to not add skip arrays whenever they aren't expected to help, but it
seems best to err on the side of robust performance.
Follow-up to commit 92fe23d9, which added nbtree skip scan.
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Teach nbtree multi-column index scans to opportunistically skip over
irrelevant sections of the index given a query with no "=" conditions on
one or more prefix index columns. When nbtree is passed input scan keys
derived from a predicate "WHERE b = 5", new nbtree preprocessing steps
output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys.
That is, preprocessing generates a "skip array" (and an output scan key)
for the omitted prefix column "a", which makes it safe to mark the scan
key on "b" as required to continue the scan. The scan is therefore able
to repeatedly reposition itself by applying both the "a" and "b" keys.
A skip array has "elements" that are generated procedurally and on
demand, but otherwise works just like a regular ScalarArrayOp array.
Preprocessing can freely add a skip array before or after any input
ScalarArrayOp arrays. Index scans with a skip array decide when and
where to reposition the scan using the same approach as any other scan
with array keys. This design builds on the design for array advancement
and primitive scan scheduling added to Postgres 17 by commit 5bf748b8.
Testing has shown that skip scans of an index with a low cardinality
skipped prefix column can be multiple orders of magnitude faster than an
equivalent full index scan (or sequential scan). In general, the
cardinality of the scan's skipped column(s) limits the number of leaf
pages that can be skipped over.
The core B-Tree operator classes on most discrete types generate their
array elements with the help of their own custom skip support routine.
This infrastructure gives nbtree a way to generate the next required
array element by incrementing (or decrementing) the current array value.
It can reduce the number of index descents in cases where the next
possible indexable value frequently turns out to be the next value
stored in the index. Opclasses that lack a skip support routine fall
back on having nbtree "increment" (or "decrement") a skip array's
current element by setting the NEXT (or PRIOR) scan key flag, without
directly changing the scan key's sk_argument. These sentinel values
behave just like any other value from an array -- though they can never
locate equal index tuples (they can only locate the next group of index
tuples containing the next set of non-sentinel values that the scan's
arrays need to advance to).
A skip array's range is constrained by "contradictory" inequality keys.
For example, a skip array on "x" will only generate the values 1 and 2
given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip
array qual usually has near-identical performance characteristics to a
comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However,
improved performance isn't guaranteed. Much depends on physical index
characteristics.
B-Tree preprocessing is optimistic about skipping working out: it
applies static, generic rules when determining where to generate skip
arrays, which assumes that the runtime overhead of maintaining skip
arrays will pay for itself -- or lead to only a modest performance loss.
As things stand, these assumptions are much too optimistic: skip array
maintenance will lead to unacceptable regressions with unsympathetic
queries (queries whose scan can't skip over many irrelevant leaf pages).
An upcoming commit will address the problems in this area by enhancing
_bt_readpage's approach to saving cycles on scan key evaluation, making
it work in a way that directly considers the needs of = array keys
(particularly = skip array keys).
Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com>
Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Reviewed-By: Aleksander Alekseev <aleksander@timescale.com>
Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru>
Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 28d3c2ddcf introduced an assertion that if the memorized
downlink location in the insertion stack isn't valid, the parent's
LSN should've changed too. Turns out that was too strict. In
gistFindCorrectParent(), if we walk right, we update the parent's
block number and clear its memorized 'downlinkoffnum'. That triggered
the assertion on next call to gistFindCorrectParent(), if the parent
needed to be split too. Relax the assertion, so that it's OK if
downlinkOffnum is InvalidOffsetNumber.
Backpatch to v13-, all supported versions. The assertion was added in
commit 28d3c2ddcf in v12.
Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reviewed-by: Tender Wang <tndrwang@gmail.com>
Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org
|
|
|
|
|
|
|
|
|
|
|
| |
Previously bitmap heap scan was not AIO batchmode safe because of the
visibility map reads potentially done for the "skip fetch" optimization
(which skipped fetching tuples from the heap if the pages were all
visible and none of the columns were used in the query).
The skip fetch optimization implementation was found to have bugs and
was removed in 459e7bf8e2f8, so we can safely enable batchmode for
bitmap heap scans.
|