aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Fix typo in comment of gistdoinsert().Masahiko Sawada2024-11-04
| | | | | | Author: Tender Wang Reviewed-by: Masahiko Sawada Discussion: https://postgr.es/m/CAHewXN%3D3sH2sNw4nC3QGCEVw1Lftmw9m5y1Xje0bXK6ApDrsPQ%40mail.gmail.com
* Fix obsolete _bt_first comments.Peter Geoghegan2024-11-04
| | | | | | | _bt_first doesn't necessarily hold onto a buffer pin on success exit. Fix header comments that claimed that we'll always hold onto a pin. Oversight in commit 2ed5b87f96.
* nbtree: Remove useless 'strat' local variable.Peter Geoghegan2024-11-04
| | | | | | | | | | | | | | | | | | | | | Remove a local variable that was used to avoid overwriting strat_total with the = operator strategy when a >= operator strategy key was already included in the initial positioning/insertion scan keys by _bt_first (for backwards scans it would have to be a <= key that was included). _bt_first's strat_total local variable now simply tracks the operator strategy of the final scan key that was included in the scan's insertion scan key (barring the case where the !used_all_subkeys row compare path adjusts strat_total in its own way). _bt_first already treated >= keys (or <= keys) as = keys for initial positioning purposes. There is no good reason to remember that that was what happened; no later _bt_first step cares about the distinction. Note, in particular, that the insertion scan key's 'nextkey' and 'backward' fields will be initialized the same way regardless. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com
* Clarify nbtree parallel scan _bt_endpoint contract.Peter Geoghegan2024-11-04
| | | | | | | | | | | | | | | _bt_endpoint is a helper function for _bt_first that's called whenever no useful insertion scan key can be used, and we need to lock and read either the leftmost or rightmost leaf page in the index. Simplify and document its preconditions, relieving its _bt_first caller from having to end the parallel scan when it returns false. Also stop unnecessarily invalidating the current scan position in nearby code in both _bt_first and _bt_endpoint. This seems to have been copy-pasted from _bt_readnextpage, where invalidating the scan's current position really is necessary. Follow-up to the refactoring work in commit 1bd4bc85.
* Fix inplace update buffer self-deadlock.Noah Misch2024-11-02
| | | | | | | | | | | | | | | A CacheInvalidateHeapTuple* callee might call CatalogCacheInitializeCache(), which needs a relcache entry. Acquiring a valid relcache entry might scan pg_class. Hence, to prevent undetected LWLock self-deadlock, CacheInvalidateHeapTuple* callers must not hold BUFFER_LOCK_EXCLUSIVE on buffers of pg_class. Move the CacheInvalidateHeapTupleInplace() before the BUFFER_LOCK_EXCLUSIVE. No back-patch, since I've reverted commit 243e9b40f1b2dd09d6e5bf91ebf6e822a2cd3704 from non-master branches. Reported by Alexander Lakhin. Reviewed by Alexander Lakhin. Discussion: https://postgr.es/m/10ec0bc3-5933-1189-6bb8-5dec4114558e@gmail.com
* Clarify nbtree array preprocessing comment.Peter Geoghegan2024-11-01
| | | | Oversight in commit 5bf748b8.
* Rename two functions that wake up other processesHeikki Linnakangas2024-11-01
| | | | | | | | | | | Instead of talking about setting latches, which is a pretty low-level mechanism, emphasize that they wake up other processes. This is in preparation for replacing Latches with a new abstraction. That's still work in progress, but this seems a little tidier anyway, so let's get this refactoring out of the way already. Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
* Use ProcNumbers instead of direct Latch pointers to address other procsHeikki Linnakangas2024-11-01
| | | | | | | | This is in preparation for replacing Latches with a new abstraction. That's still work in progress, but this seems a little tidier anyway, so let's get this refactoring out of the way already. Discussion: https://www.postgresql.org/message-id/391abe21-413e-4d91-a650-b663af49500c%40iki.fi
* nbtree: assert no scheduled primscan between pages.Peter Geoghegan2024-10-30
| | | | | | | | | | Follow-up to bugfix commit 763d65ae. Technically this new assertion is redundant with the assertion recently added to _bt_readpage by that same commit, but it seems like a good idea to have both. The new assertion makes it clear that we expect to call _bt_readnextpage when there's another primitive index scan scheduled, though only when needed as the final step of ending the current primitive scan.
* Clarify nbtree array exhaustion comments.Peter Geoghegan2024-10-30
| | | | | | | | | | | | | | | | Strictly speaking, we only need to make sure to leave the scan's array keys in their final positions (final for the current scan direction) to handle SAOP array exhaustion because btgettuple might only return a subset of the items for the final page (final for the current scan direction), before the scan changes direction. While it's typical for so->currPos to be invalidated shortly after the scan's arrays are first exhausted, and while so->currPos invalidation does obviate the need to leave the scan's arrays in any particular state, we can't rely on any of that actually happening when handling array exhaustion. Adjust comments to make all of that a lot clearer. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution.
* Fix bug in nbtree array primitive scan scheduling.Peter Geoghegan2024-10-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A bug in nbtree's handling of primitive index scan scheduling could lead to wrong answers when a scrollable cursor was used with an index scan that had a SAOP index qual. Wrong answers were only possible when the scan direction changed after a primitive scan was scheduled, but before _bt_next was asked to fetch the next tuple in line (i.e. for things to break, _bt_next had to be denied the opportunity to step off the page in the same direction as the one used when the primscan was scheduled). Furthermore, the issue only occurred when the page in question happened to be the first page to be visited by the entire top-level scan; the issue hinged upon the cursor backing up to the absolute beginning of the key space that it returns tuples from (fetching in the opposite scan direction across a "primitive scan boundary" always worked correctly). To fix, make _bt_next unset the "needs primitive index scan" flag when it detects that the current scan direction is not the one that was used by _bt_readpage back when the primitive scan in question was scheduled. This fixes the cases that are known to be faulty, and also seems like a good idea on general robustness grounds. Affected scrollable cursor cases now avoid a spurious primitive index scan when they fetch backwards to the absolute start of the key space to be visited by their cursor. Fetching backwards now only returns those tuples at the start of the scan, as expected. It'll also be okay to once again fetch forwards from the start at that point, since the scan will be left in a state that's exactly consistent with the state it was in before any tuples were ever fetched, as expected. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-Wznv49bFsE2jkt4GuZ0tU2C91dEST=50egzjY2FeOcHL4Q@mail.gmail.com Backpatch: 17-, where commit 5bf748b8 first appears.
* Unpin buffer before inplace update waits for an XID to end.Noah Misch2024-10-29
| | | | | | | | | | | | Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. By keeping the pin during that wait, a sequence of autovacuum workers and an uncommitted GRANT starved one foreground LockBufferForCleanup() for six minutes, on buildfarm member sarus. Prevent, at the cost of a bit of complexity. Back-patch to v12, like the earlier commit. That commit and heap_inplace_lock() have not yet appeared in any release. Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com
* Fix WAL_DEBUG buildPeter Eisentraut2024-10-28
| | | | | | broken by commit e18512c000e Reported-by: Peter Geoghegan <pg@bowt.ie>
* nbtree: Minor sibling link traversal tweaks.Peter Geoghegan2024-10-28
| | | | | | | Tweak some code comments for clarity, and relocate some local variable declarations to the scope where they're actually used. Follow-up to recent commit 1bd4bc85.
* Fix obsolete nbtree split buffer comment.Peter Geoghegan2024-10-27
| | | | Oversight in commit d088ba5a.
* Remove unused #include's from backend .c filesPeter Eisentraut2024-10-27
| | | | | | | | as determined by IWYU These are mostly issues that are new since commit dbbca2cf299. Discussion: https://www.postgresql.org/message-id/flat/0df1d5b1-8ca8-4f84-93be-121081bde049%40eisentraut.org
* Make table_scan_bitmap_next_block() async-friendlyMelanie Plageman2024-10-25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move all responsibility for indicating a block is exhuasted into table_scan_bitmap_next_tuple() and advance the main iterator in heap-specific code. This flow control makes more sense and is a step toward using the read stream API for bitmap heap scans. Previously, table_scan_bitmap_next_block() returned false to indicate table_scan_bitmap_next_tuple() should not be called for the tuples on the page. This happened both when 1) there were no visible tuples on the page and 2) when the block returned by the iterator was past the end of the table. BitmapHeapNext() (generic bitmap table scan code) handled the case when the bitmap was exhausted. It makes more sense for table_scan_bitmap_next_tuple() to return false when there are no visible tuples on the page and table_scan_bitmap_next_block() to return false when the bitmap is exhausted or there are no more blocks in the table. As part of this new design, TBMIterateResults are no longer used as a flow control mechanism in BitmapHeapNext(), so we removed table_scan_bitmap_next_tuple's TBMIterateResult parameter. Note that the prefetch iterator is still saved in the BitmapHeapScanState node and advanced in generic bitmap table scan code. This is because 1) it was not necessary to change the prefetch iterator location to change the flow control in BitmapHeapNext() 2) modifying prefetch iterator management requires several more steps better split over multiple commits and 3) the prefetch iterator will be removed once the read stream API is used. Author: Melanie Plageman Reviewed-by: Tomas Vondra, Andres Freund, Heikki Linnakangas, Mark Dilger Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
* Move EXPLAIN counter increment to heapam_scan_bitmap_next_blockMelanie Plageman2024-10-25
| | | | | | | | | | | | | | | | Increment the lossy and exact page counters for EXPLAIN of bitmap heap scans in heapam_scan_bitmap_next_block(). Note that other table AMs will need to do this as well Pushing the counters into heapam_scan_bitmap_next_block() is required to be able to use the read stream API for bitmap heap scans. The bitmap iterator must be advanced from inside the read stream callback, so TBMIterateResults cannot be used as a flow control mechanism in BitmapHeapNext(). Author: Melanie Plageman Reviewed-by: Tomas Vondra, Heikki Linnakangas Discussion: https://postgr.es/m/063e4eb4-32d9-439e-a0b1-75565a9835a8%40iki.fi
* WAL-log inplace update before revealing it to other sessions.Noah Misch2024-10-25
| | | | | | | | | | | | A buffer lock won't stop a reader having already checked tuple visibility. If a vac_update_datfrozenid() and then a crash happened during inplace update of a relfrozenxid value, datfrozenxid could overtake relfrozenxid. That could lead to "could not access status of transaction" errors. Back-patch to v12 (all supported versions). In v14 and earlier, this also back-patches the assertion removal from commit 7fcf2faf9c7dd473208fd6d5565f88d7f733782b. Discussion: https://postgr.es/m/20240620012908.92.nmisch@google.com
* For inplace update, send nontransactional invalidations.Noah Misch2024-10-25
| | | | | | | | | | | | | | | | The inplace update survives ROLLBACK. The inval didn't, so another backend's DDL could then update the row without incorporating the inplace update. In the test this fixes, a mix of CREATE INDEX and ALTER TABLE resulted in a table with an index, yet relhasindex=f. That is a source of index corruption. Back-patch to v12 (all supported versions). The back branch versions don't change WAL, because those branches just added end-of-recovery SIResetAll(). All branches change the ABI of extern function PrepareToInvalidateCacheTuple(). No PGXN extension calls that, and there's no apparent use case in extensions. Reviewed by Nitin Motiani and (in earlier versions) Andres Freund. Discussion: https://postgr.es/m/20240523000548.58.nmisch@google.com
* Fix parallel worker tracking of new catalog relfilenumbers.Noah Misch2024-10-24
| | | | | | | | | | | | | | | | | | | Reunite RestorePendingSyncs() with RestoreRelationMap(). If RelationInitPhysicalAddr() ran after RestoreRelationMap() but before RestorePendingSyncs(), the relcache entry could cause RelationNeedsWAL() to return true erroneously. Trouble required commands of the current transaction to include REINDEX or CLUSTER of a system catalog. The parallel leader correctly derived RelationNeedsWAL()==false from the new relfilenumber, but the worker saw RelationNeedsWAL()==true. Worker MarkBufferDirtyHint() then wrote unwanted WAL. Recovery of that unwanted WAL could lose tuples like the system could before commit c6b92041d38512a4176ed76ad06f713d2e6c01a8 introduced this tracking. RestorePendingSyncs() and RestoreRelationMap() were adjacent till commit 126ec0bc76d044d3a9eb86538b61242bf7da6db4, so no back-patch for now. Reviewed by Tom Lane. Discussion: https://postgr.es/m/20241019232815.c6.nmisch@google.com
* Stop reading uninitialized memory in heap_inplace_lock().Noah Misch2024-10-24
| | | | | | | | | | Stop computing a never-used value. This removes the read; the read had no functional implications. Back-patch to v12, like commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257. Reported by Alexander Lakhin. Discussion: https://postgr.es/m/6c92f59b-f5bc-e58c-9bdd-d1f21c17c786@gmail.com
* Add 'no_error' argument to pg_wal_replay_wait()Alexander Korotkov2024-10-24
| | | | | | | | | | | This argument allow skipping throwing an error. Instead, the result status can be obtained using pg_wal_replay_wait_status() function. Catversion is bumped. Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZtUF17gF0pNpwZDI%40paquier.xyz Reviewed-by: Pavel Borisov
* Refactor WaitForLSNReplay() to return the result of waitingAlexander Korotkov2024-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, WaitForLSNReplay() immediately throws an error if waiting for LSN replay is not successful. This commit teaches WaitForLSNReplay() to return the result of waiting, while making pg_wal_replay_wait() responsible for throwing an appropriate error. This is preparation to adding 'no_error' argument to pg_wal_replay_wait() and new function pg_wal_replay_wait_status(), which returns the last wait result status. Additionally, we stop distinguishing situations when we find our instance to be not in a recovery state before entering the waiting loop and inside the waiting loop. Standby promotion may happen at any moment, even between issuing a procedure call statement and pg_wal_replay_wait() doing a first check of recovery status. Thus, there is no pointing distinguishing these situations. Also, since we may exit the waiting loop and see our instance not in recovery without throwing an error, we need to deleteLSNWaiter() in that case. We do this unconditionally for the sake of simplicity, even if standby was already promoted after reaching the target LSN, the startup process surely already deleted us. Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZtUF17gF0pNpwZDI%40paquier.xyz Reviewed-by: Michael Paquier, Pavel Borisov
* Make WaitForLSNReplay() issue FATAL on postmaster deathAlexander Korotkov2024-10-24
| | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZvY2C8N4ZqgCFaLu%40paquier.xyz Reviewed-by: Pavel Borisov
* Move LSN waiting declarations and definitions to better placeAlexander Korotkov2024-10-24
| | | | | | | | | | | | | | | | | | | 3c5db1d6b implemented the pg_wal_replay_wait() stored procedure. Due to the patch development history, the implementation resided in src/backend/commands/waitlsn.c (src/include/commands/waitlsn.h for headers). 014f9f34d moved pg_wal_replay_wait() itself to src/backend/access/transam/xlogfuncs.c near to the WAL-manipulation functions. But most of the implementation stayed in place. The code in src/backend/commands/waitlsn.c has nothing to do with commands, but is related to WAL. So, this commit moves this code into src/backend/access/transam/xlogwait.c (src/include/access/xlogwait.h for headers). Reported-by: Peter Eisentraut Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org Reviewed-by: Pavel Borisov
* Avoid looping over all type cache entries in TypeCacheRelCallback()Alexander Korotkov2024-10-24
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, when a single relcache entry gets invalidated, TypeCacheRelCallback() has to loop over all type cache entries to find appropriate typentry to invalidate. Unfortunately, using the syscache here is impossible, because this callback could be called outside a transaction and this makes impossible catalog lookups. This is why present commit introduces RelIdToTypeIdCacheHash to map relation OID to its composite type OID. We are keeping RelIdToTypeIdCacheHash entry while corresponding type cache entry have something to clean. Therefore, RelIdToTypeIdCacheHash shouldn't get bloat in the case of temporary tables flood. There are many places in lookup_type_cache() where syscache invalidation, user interruption, or even error could occur. In order to handle this, we keep an array of in-progress type cache entries. In the case of lookup_type_cache() interruption this array is processed to keep RelIdToTypeIdCacheHash in a consistent state. Discussion: https://postgr.es/m/5812a6e5-68ae-4d84-9d85-b443176966a1%40sigaev.ru Author: Teodor Sigaev Reviewed-by: Aleksander Alekseev, Tom Lane, Michael Paquier, Roman Zharkov Reviewed-by: Andrei Lepikhov, Pavel Borisov, Jian He, Alexander Lakhin Reviewed-by: Artur Zakirov
* Optimize nbtree backwards scans.Peter Geoghegan2024-10-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make nbtree backwards scans optimistically access the next page to be read to the left by following a prevPage block number that's now stashed in currPos when the leaf page is first read. This approach matches the one taken during forward scans, which follow a symmetric nextPage block number from currPos. We stash both a prevPage and a nextPage, since the scan direction might change (when fetching from a scrollable cursor). Backwards scans will no longer need to lock the same page twice, except in rare cases where the scan detects a concurrent page split (or page deletion). Testing has shown this optimization to be particularly effective during parallel index-only backwards scans: ~12% reductions in query execution time are quite possible. We're much better off being optimistic; concurrent left sibling page splits are rare in general. It's possible that we'll need to lock more pages than the pessimistic approach would have, but only when there are _multiple_ concurrent splits of the left sibling page we now start at. If there's just a single concurrent left sibling page split, the new approach to scanning backwards will at least break even relative to the old one (we'll acquire the same number of leaf page locks as before). The optimization from this commit has long been contemplated by comments added by commit 2ed5b87f96, which changed the rules for locking/pinning during nbtree index scans. The approach that that commit introduced to leaf level link traversal when scanning forwards is now more or less applied all the time, regardless of the direction we're scanning in. Following uniform conventions around sibling link traversal is simpler. The only real remaining difference between our forward and backwards handling is that our backwards handling must still detect and recover from any concurrent left sibling splits (and concurrent page deletions), as documented in the nbtree README. That is structured as a single, isolated extra step that takes place in _bt_readnextpage. Also use this opportunity to further simplify the functions that deal with reading pages and traversing sibling links on the leaf level, and to document their preconditions and postconditions (with respect to things like buffer locks, buffer pins, and seizing the parallel scan). This enhancement completely supersedes the one recently added by commit 3f44959f. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WgpBGRgTTxTWVPXc9+PB6fc1a7t+VyGXHzfnrFXcQVxnA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkBTuFv7W2+84jJT8mWZLXVL0GHq2hMUTn6c9Vw=eYrCw@mail.gmail.com
* Remove unused code for unlogged materialized views.Fujii Masao2024-10-18
| | | | | | | | | | | | | | | | | Commit 3bf3ab8c56 initially introduced support for unlogged materialized views, but this was later disallowed by commit 3223b25ff7. Additionally, commit d25f519107 added more code for handling unlogged materialized views. This commit cleans up all unused code related to them. If unlogged materialized views had been supported in any official release, psql would need to retain code to handle them for compatibility with older servers. However, since they were never included in an official release, this code is no longer necessary. Author: Pixian Shi Reviewed-by: Yugo Nagata, Fujii Masao Discussion: https://postgr.es/m/CAAccyYKRZ=OvAvgowiSH+OELbStLP=p2Ht=R3CgT=OaNSH5DAA@mail.gmail.com
* nbtree: fix read page recheck typo.Peter Geoghegan2024-10-16
| | | | Oversight in commit 79fa7b3b.
* Normalize nbtree truncated high key array behavior.Peter Geoghegan2024-10-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bf748b8 taught nbtree ScalarArrayOp index scans to decide when and how to start the next primitive index scan based on physical index characteristics. This included rules for deciding whether to start a new primitive index scan (or whether to move onto the right sibling leaf page instead) that specifically consider truncated lower-order columns (-inf columns) from leaf page high keys. These omitted columns were treated as satisfying the scan's required scan keys, though only for scan keys marked required in the current scan direction (forward). Scan keys that didn't get this behavior (those marked required in the backwards direction only) usually didn't give the scan reasonable cause to reposition itself to a later leaf page (via another descent of the index in _bt_first), but _bt_advance_array_keys would nevertheless always give up by forcing another call to _bt_first. _bt_advance_array_keys was unwilling to allow the scan to continue onto the next leaf page, to reconsider whether we really should start another primitive scan based on the details of the sibling page's tuples. This didn't match its behavior with similar cases involving keys required in the current scan direction (forward), which seems unprincipled. It led to an excessive number of primitive scans/index descents for queries with a higher-order = array scan key (with dense, contiguous values) mixed with a lower-order required > or >= scan key. Bring > and >= strategy scan keys in line with other required scan key types: treat truncated -inf scan keys as having satisfied scan keys required in either scan direction (forwards and backwards alike) during array advancement. That way affected scans can continue to the right sibling leaf page. Advancement must now schedule an explicit recheck of the right sibling page's high key in cases involving > or >= scan keys. The recheck gives the scan a way to back out and start another primitive index scan (we can't just rely on _bt_checkkeys with > or >= scan keys). This work can be considered a stand alone optimization on top of the work from commit 5bf748b8. But it was written in preparation for an upcoming patch that will add skip scan to nbtree. In practice scans that use "skip arrays" will tend to be much more sensitive to any implementation deficiencies in this area. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
* Fix fetching default toast value during decoding of in-progress transactions.Amit Kapila2024-10-07
| | | | | | | | | | | | | | | During logical decoding of in-progress transactions, we perform the toast table scan while fetching the default toast value for an attribute. We forgot to initialize the flag during this scan to indicate that the system table scan is in progress. We need this flag to ensure that during logical decoding we never directly access the tableam or heap APIs because we check for concurrent aborts only in systable_* APIs. Reported-by: Alexander Lakhin Author: Takeshi Ideriha, Hou Zhijie Reviewed-by: Amit Kapila, Hou Zhijie Backpatch-through: 14 Discussion: https://postgr.es/m/18641-6687273b7f15269d@postgresql.org
* Rename PageData to GenericXLogPageDataPeter Eisentraut2024-10-04
| | | | | | | | | | | | In the PostgreSQL C type naming schema, the type PageData should be what the pointer of type Page points to. But in this case it's actually an unrelated type local to generic_xlog.c. Rename that to a more specific name. This makes room to possible add a PageData type with the mentioned meaning, but this is not done here. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://www.postgresql.org/message-id/flat/001d457e-c118-4219-8132-e1846c2ae3c9%40eisentraut.org
* Fix inconsistent reporting of checkpointer stats.Fujii Masao2024-10-02
| | | | | | | | | | | | | | | | | | | | | | | Previously, the pg_stat_checkpointer view and the checkpoint completion log message could show different numbers for buffers written during checkpoints. The view only counted shared buffers, while the log message included both shared and SLRU buffers, causing inconsistencies. This commit resolves the issue by updating both the view and the log message to separately report shared and SLRU buffers written during checkpoints. A new slru_written column is added to the pg_stat_checkpointer view to track SLRU buffers, while the existing buffers_written column now tracks only shared buffers. This change would help users distinguish between the two types of buffers, in the pg_stat_checkpointer view and the checkpoint complete log message, respectively. Bump catalog version. Author: Nitin Jadhav Reviewed-by: Bharath Rupireddy, Michael Paquier, Kyotaro Horiguchi, Robert Haas Reviewed-by: Andres Freund, vignesh C, Fujii Masao Discussion: https://postgr.es/m/CAMm1aWb18EpT0whJrjG+-nyhNouXET6ZUw0pNYYAe+NezpvsAA@mail.gmail.com
* Fix race condition in COMMIT PREPARED causing orphaned 2PC filesMichael Paquier2024-10-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | COMMIT PREPARED removes on-disk 2PC files near its end, but the state checked if a file is on-disk or not gets read from shared memory while not holding the two-phase state lock. Because of that, there was a small window where a second backend doing a PREPARE TRANSACTION could reuse the GlobalTransaction put back into the 2PC free list by the COMMIT PREPARED, overwriting the "ondisk" flag read afterwards by the COMMIT PREPARED to decide if its on-disk two-phase state file should be removed, preventing the file deletion. This commit fixes this issue so as the "ondisk" flag in the GlobalTransaction is read while holding the two-phase state lock, not from shared memory after its entry has been added to the free list. Orphaned two-phase state files flushed to disk after a checkpoint are discarded at the beginning of recovery. However, a truncation of pg_xact/ would make the startup process issue a FATAL when it cannot read the SLRU page holding the state of the transaction whose 2PC file was orphaned, which is a necessary step to decide if the 2PC file should be removed or not. Removing manually the file would be necessary in this case. Issue introduced by effe7d9552dd, so backpatch all the way down. Mea culpa. Author: wuchengwen Discussion: https://postgr.es/m/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com Backpatch-through: 12
* Add num_done counter to the pg_stat_checkpointer view.Fujii Masao2024-09-30
| | | | | | | | | | | | | | | Checkpoints can be skipped when the server is idle. The existing num_timed and num_requested counters in pg_stat_checkpointer track both completed and skipped checkpoints, but there was no way to count only the completed ones. This commit introduces the num_done counter, which tracks only completed checkpoints, making it easier to see how many were actually performed. Bump catalog version. Author: Anton A. Melnikov Reviewed-by: Fujii Masao Discussion: https://postgr.es/m/9ea77f40-818d-4841-9dee-158ac8f6e690@oss.nttdata.com
* Set query ID in parallel workers for vacuum, BRIN and btreeMichael Paquier2024-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | All these code paths use their own entry point when starting parallel workers, but failed to set a query ID, even if they set a text query. Hence, this data would be missed in pg_stat_activity for the worker processes. The main entry point for parallel query processing, ParallelQueryMain(), is already doing that by saving its query ID in a dummy PlannedStmt, but not the others. The code is changed so as the query ID of these queries is set in their shared state, and reported back once the parallel workers start. Some tests are added to show how the failures can happen for btree and BRIN with a parallel build enforced, which are able to trigger a failure in an assertion added by 24f520594809 in the recovery TAP test 027_stream_regress.pl where pg_stat_statements is always loaded. In this case, the executor path was taken because the index expression needs to be flattened when building its IndexInfo. Alexander Lakhin has noticed the problem in btree, and I have noticed that the issue was more spread. This is arguably a bug, but nobody has complained about that until now, so no backpatch is done out of caution. If folks would like to see a backpatch, well, let me know. Reported-by: Alexander Lakhin Reviewed-by: Sami Imseih Discussion: https://postgr.es/m/cf3547c1-498a-6a61-7b01-819f902a251f@gmail.com
* For inplace update durability, make heap_update() callers wait.Noah Misch2024-09-24
| | | | | | | | | | | | | | | | | | | The previous commit fixed some ways of losing an inplace update. It remained possible to lose one when a backend working toward a heap_update() copied a tuple into memory just before inplace update of that tuple. In catalogs eligible for inplace update, use LOCKTAG_TUPLE to govern admission to the steps of copying an old tuple, modifying it, and issuing heap_update(). This includes MERGE commands. To avoid changing most of the pg_class DDL, don't require LOCKTAG_TUPLE when holding a relation lock sufficient to exclude inplace updaters. Back-patch to v12 (all supported versions). In v13 and v12, "UPDATE pg_class" or "UPDATE pg_database" can still lose an inplace update. The v14+ UPDATE fix needs commit 86dc90056dfdbd9d1b891718d2e5614e3e432f35, and it wasn't worth reimplementing that fix without such infrastructure. Reviewed by Nitin Motiani and (in earlier versions) Heikki Linnakangas. Discussion: https://postgr.es/m/20231027214946.79.nmisch@google.com
* Fix data loss at inplace update after heap_update().Noah Misch2024-09-24
| | | | | | | | | | | | | | | | | | | | | | As previously-added tests demonstrated, heap_inplace_update() could instead update an unrelated tuple of the same catalog. It could lose the update. Losing relhasindex=t was a source of index corruption. Inplace-updating commands like VACUUM will now wait for heap_update() commands like GRANT TABLE and GRANT DATABASE. That isn't ideal, but a long-running GRANT already hurts VACUUM progress more just by keeping an XID running. The VACUUM will behave like a DELETE or UPDATE waiting for the uncommitted change. For implementation details, start at the systable_inplace_update_begin() header comment and README.tuplock. Back-patch to v12 (all supported versions). In back branches, retain a deprecated heap_inplace_update(), for extensions. Reported by Smolkin Grigory. Reviewed by Nitin Motiani, (in earlier versions) Heikki Linnakangas, and (in earlier versions) Alexander Lakhin. Discussion: https://postgr.es/m/CAMp+ueZQz3yDk7qg42hk6-9gxniYbp-=bG2mgqecErqR5gGGOA@mail.gmail.com
* Update obsolete nbtree array preprocessing comments.Peter Geoghegan2024-09-24
| | | | | | | | | | | | The array->scan_key references fixed up at the end of preprocessing start out as offsets into the arrayKeyData[] array (the array returned by _bt_preprocess_array_keys at the start of preprocessing that involves array scan keys). Offsets into the arrayKeyData[] array are no longer guaranteed to be valid offsets into our original scan->keyData[] input scan key array, but comments describing the array->scan_key references still talked about scan->keyData[]. Update those comments. Oversight in commit b5249741.
* Refactor handling of nbtree array redundancies.Peter Geoghegan2024-09-21
| | | | | | | | | | | | | | | | | | | | | Teach _bt_preprocess_array_keys to eliminate redundant array equality scan keys directly, rather than just marking them as redundant. Its _bt_preprocess_keys caller is no longer required to ignore input scan keys that were marked redundant in this way. Oversights like the one fixed by commit f22e17f7 are no longer possible. The new scheme also makes it easier for _bt_preprocess_keys to output a so.keyData[] scan key array with _more_ scan keys than it was passed in its scan.keyData[] input scan key array. An upcoming patch that adds skip scan optimizations to nbtree will take advantage of this. In passing, remove and rename certain _bt_preprocess_keys variables to make the difference between our input scan key array and our output scan key array clearer. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
* Improve Asserts checking relation matching in parallel scans.Tom Lane2024-09-20
| | | | | | | | | | | | | | | | table_beginscan_parallel and index_beginscan_parallel contain Asserts checking that the relation a worker will use in a parallel scan is the same one the leader intended. However, they were checking for relation OID match, which was not strong enough to detect the mismatch problem fixed in 126ec0bc7. What would be strong enough is to compare relfilenodes instead. Arguably, that's a saner definition anyway, since a scan surely operates on a physical relation not a logical one. Hence, store and compare RelFileLocators not relation OIDs. Also ensure that index_beginscan_parallel checks the index identity not just the table identity. Discussion: https://postgr.es/m/2127254.1726789524@sss.pgh.pa.us
* Fix nbtree pgstats accounting with parallel scans.Peter Geoghegan2024-09-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. Under this scheme a parallel index scan with array keys will perform the same number of index descents as an equivalent serial index scan (barring corner cases where an individual parallel worker discovers that it can advance the scan's array keys without anybody needing to perform another descent of the index to get to the relevant page on the leaf level). Despite all this, the pgstats accounting wasn't updated; it continued to increment the total number of index scans for the rel once per _bt_first call, no matter the details. As a result, the number of (primitive) index scans could be over-counted during parallel scans. To fix, delay incrementing the count of index scans until after we've established that another descent of the index (using either _bt_search or _bt_endpoint) is required. That way pg_stat_user_tables.idx_scan always advances in the same way, regardless of whether or not the scan makes use of parallelism. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Tomas Vondra <tomas@vondra.me> Discussion: https://postgr.es/m/CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
* Restore relmapper state early enough in parallel workers.Tom Lane2024-09-19
| | | | | | | | | | | | | | | | | | | | | | | We need to do RestoreRelationMap before loading catalog-derived state, else the worker may end up with catalog relcache entries containing stale relfilenode data. Move up RestoreReindexState too, on the principle that that should also happen before we do much of any catalog access. I think ideally these things would happen even before InitPostgres, but there are various problems standing in the way of that, notably that the relmapper thinks "active" mappings should be discarded at transaction end. The implication of this is that InitPostgres and RestoreLibraryState will see the same catalog state as an independent backend would see, which is probably fine; at least, it's been like that all along. Per report from Justin Pryzby. There is a case to be made that this should be back-patched. But given the lack of complaints before 6e086fa2e and the short amount of time remaining before 17.0 wraps, I'll just put it in HEAD for now. Discussion: https://postgr.es/m/ZuoU_8EbSTE14o1U@pryzbyj2023
* Move pg_wal_replay_wait() to xlogfuncs.cAlexander Korotkov2024-09-19
| | | | | | | | | | | This commit moves pg_wal_replay_wait() procedure to be a neighbor of WAL-related functions in xlogfuncs.c. The implementation of LSN waiting continues to reside in the same place. By proposal from Michael Paquier. Reported-by: Peter Eisentraut Discussion: https://postgr.es/m/18c0fa64-0475-415e-a1bd-665d922c5201%40eisentraut.org
* Extend PgStat_HashKey.objid from 4 to 8 bytesMichael Paquier2024-09-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This opens the possibility to define keys for more types of statistics kinds in PgStat_HashKey, the first case being 8-byte query IDs for statistics like pg_stat_statements. This increases the size of PgStat_HashKey from 12 to 16 bytes, while PgStatShared_HashEntry, entry stored in the dshash for pgstats, keeps the same size due to alignment. xl_xact_stats_item, that tracks the stats items to drop in commit WAL records, is increased from 12 to 16 bytes. Note that individual chunks in commit WAL records should be multiples of sizeof(int), hence 8-byte object IDs are stored as two uint32, based on a suggestion from Heikki Linnakangas. While on it, the field of PgStat_HashKey is renamed from "objoid" to "objid", as for some stats kinds this field does not refer to OIDs but just IDs, like for replication slot stats. This commit bumps the following format variables: - PGSTAT_FILE_FORMAT_ID, as PgStat_HashKey is written to the stats file for non-serialized stats kinds in the dshash table. - XLOG_PAGE_MAGIC for the changes in xl_xact_stats_item. - Catalog version, for the SQL function pg_stat_have_stats(). Reviewed-by: Bertrand Drouvot Discussion: https://postgr.es/m/ZsvTS9EW79Up8I62@paquier.xyz
* Minor cleanup related to pg_wal_replay_wait() procedureAlexander Korotkov2024-09-17
| | | | | | | | | | | | * Rename $node_standby1 to $node_standby in 043_wal_replay_wait.pl as there is only one standby. * Remove useless debug printing in 043_wal_replay_wait.pl. * Fix typo in one check description in 043_wal_replay_wait.pl. * Fix some wording in comments and documentation. Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/1d7b08f2-64a2-77fb-c666-c9a74c68eeda%40gmail.com Reviewed-by: Alexander Lakhin
* Avoid parallel nbtree index scan hangs with SAOPs.Peter Geoghegan2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. A backend that successfully scheduled the scan's next primitive index scan saved its backend local array keys in shared memory. Any backend could pick up the scheduled primitive scan within _bt_first. This scheme decouples scheduling a primitive scan from starting the scan (by performing another descent of the index via a _bt_search call from _bt_first) to make things robust. The scheme had a deadlock hazard, at least when the leader process participated in the scan. _bt_parallel_seize had a code path that made backends that were not in an immediate position to start a scheduled primitive index scan wait for some other backend to do so instead. Under the right circumstances, the leader process could wait here forever: the leader would wait for any other backend to start the primitive scan, while every worker was busy waiting on the leader to consume tuples from the scan's tuple queue. To fix, don't wait for a scheduled primitive index scan to be started by some other eligible backend from within _bt_parallel_seize (when the calling backend isn't in a position to do so itself). Return false instead, while recording that the scan has a scheduled primitive index scan in backend local state. This leaves the backend in the same state as the existing case where a backend schedules (or tries to schedule) another primitive index scan from within _bt_advance_array_keys, before calling _bt_parallel_seize. _bt_parallel_seize already handles that case by returning false without waiting, and without unsetting the backend local state. Leaving the backend in this state enables it to start a previously scheduled primitive index scan once it gets back to _bt_first. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Matthias van de Meent, with tweaks by me. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
* Add temporal PRIMARY KEY and UNIQUE constraintsPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add WITHOUT OVERLAPS clause to PRIMARY KEY and UNIQUE constraints. These are backed by GiST indexes instead of B-tree indexes, since they are essentially exclusion constraints with = for the scalar parts of the key and && for the temporal part. (previously committed as 46a0cd4cefb, reverted by 46a0cd4cefb; the new part is this:) Because 'empty' && 'empty' is false, the temporal PK/UQ constraint allowed duplicates, which is confusing to users and breaks internal expectations. For instance, when GROUP BY checks functional dependencies on the PK, it allows selecting other columns from the table, but in the presence of duplicate keys you could get the value from any of their rows. So we need to forbid empties. This all means that at the moment we can only support ranges and multiranges for temporal PK/UQs, unlike the original patch (above). Documentation and tests for this are added. But this could conceivably be extended by introducing some more general support for the notion of "empty" for other types. Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com
* Add stratnum GiST support functionPeter Eisentraut2024-09-17
| | | | | | | | | | | | | | | | | | | | | This is support function 12 for the GiST AM and translates "well-known" RT*StrategyNumber values into whatever strategy number is used by the opclass (since no particular numbers are actually required). We will use this to support temporal PRIMARY KEY/UNIQUE/FOREIGN KEY/FOR PORTION OF functionality. This commit adds two implementations, one for internal GiST opclasses (just an identity function) and another for btree_gist opclasses. It updates btree_gist from 1.7 to 1.8, adding the support function for all its opclasses. (previously committed as 6db4598fcb8, reverted by 8aee330af55; this is essentially unchanged from those) Author: Paul A. Jungwirth <pj@illuminatedcomputing.com> Reviewed-by: Peter Eisentraut <peter@eisentraut.org> Reviewed-by: jian he <jian.universality@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/CA+renyUApHgSZF9-nd-a0+OPGharLQLO=mDHcY4_qQ0+noCUVg@mail.gmail.com