aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Improve quoting in some error messagesPeter Eisentraut2021-04-14
|
* Don't truncate heap when VACUUM's failsafe is in effect.Peter Geoghegan2021-04-13
| | | | | | | | | | | | It seems like a good idea to bypass heap truncation when the wraparound failsafe mechanism (which was added in commit 1e55e7d1) is in effect. Deliberately don't bypass heap truncation in the INDEX_CLEANUP=off case, even though it is similar to the failsafe case. There is already a separate reloption (and related VACUUM parameter) for that. Reported-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoDWRh6oTN5T8wa+cpZUVpHXET8BJ8Da7WHVHpwkPP6KLg@mail.gmail.com
* Avoid improbable PANIC during heap_update.Tom Lane2021-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | heap_update needs to clear any existing "all visible" flag on the old tuple's page (and on the new page too, if different). Per coding rules, to do this it must acquire pin on the appropriate visibility-map page while not holding exclusive buffer lock; which creates a race condition since someone else could set the flag whenever we're not holding the buffer lock. The code is supposed to handle that by re-checking the flag after acquiring buffer lock and retrying if it became set. However, one code path through heap_update itself, as well as one in its subroutine RelationGetBufferForTuple, failed to do this. The end result, in the unlikely event that a concurrent VACUUM did set the flag while we're transiently not holding lock, is a non-recurring "PANIC: wrong buffer passed to visibilitymap_clear" failure. This has been seen a few times in the buildfarm since recent VACUUM changes that added code paths that could set the all-visible flag while holding only exclusive buffer lock. Previously, the flag was (usually?) set only after doing LockBufferForCleanup, which would insist on buffer pin count zero, thus preventing the flag from becoming set partway through heap_update. However, it's clear that it's heap_update not VACUUM that's at fault here. What's less clear is whether there is any hazard from these bugs in released branches. heap_update is certainly violating API expectations, but if there is no code path that can set all-visible without a cleanup lock then it's only a latent bug. That's not 100% certain though, besides which we should worry about extensions or future back-patch fixes that could introduce such code paths. I chose to back-patch to v12. Fixing RelationGetBufferForTuple before that would require also back-patching portions of older fixes (notably 0d1fe9f74), which is more code churn than seems prudent to fix a hypothetical issue. Discussion: https://postgr.es/m/2247102.1618008027@sss.pgh.pa.us
* Fix potential SSI hazard in heap_update().Thomas Munro2021-04-13
| | | | | | | | | | | | | | | | | Commit 6f38d4dac38 failed to heed a warning about the stability of the value pointed to by "otid". The caller is allowed to pass in a pointer to newtup->t_self, which will be updated during the execution of the function. Instead, the SSI check should use the value we copy into oldtup.t_self near the top of the function. Not a live bug, because newtup->t_self doesn't really get updated until a bit later, but it was confusing and broke the rule established by the comment. Back-patch to 13. Reported-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/2689164.1618160085%40sss.pgh.pa.us
* Remove COMMIT_TS_SETTS record.Fujii Masao2021-04-12
| | | | | | | | | | | | | | | | | | | Commit 438fc4a39c prevented the WAL replay from writing COMMIT_TS_SETTS record. By this change there is no code that generates COMMIT_TS_SETTS record in PostgreSQL core. Also we can think that there are no extensions using the record because we've not received so far any complaints about the issue that commit 438fc4a39c fixed. Therefore this commit removes COMMIT_TS_SETTS record and its related code. Even without this record, the timestamp required for commit timestamp feature can be acquired from the COMMIT record. Bump WAL page magic. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Doc: Review for "Optionally prefetch referenced data in recovery."Thomas Munro2021-04-10
| | | | | | | | Typos, corrections and language improvements in the docs, and a few in code comments too. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
* Silence another _bt_check_unique compiler warning.Peter Geoghegan2021-04-08
| | | | | | Per complaint from Tom Lane Discussion: https://postgr.es/m/1922884.1617909599@sss.pgh.pa.us
* Optionally prefetch referenced data in recovery.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Add circular WAL decoding buffer.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | | | | | Teach xlogreader.c to decode its output into a circular buffer, to support optimizations based on looking ahead. * XLogReadRecord() works as before, consuming records one by one, and allowing them to be examined via the traditional XLogRecGetXXX() macros. * An alternative new interface XLogNextRecord() is added that returns pointers to DecodedXLogRecord structs that can be examined directly. * XLogReadAhead() provides a second cursor that lets you see further ahead, as long as data is available and there is enough space in the decoding buffer. This returns DecodedXLogRecord pointers to the caller, but also adds them to a queue of records that will later be consumed by XLogNextRecord()/XLogReadRecord(). The buffer's size is controlled with wal_decode_buffer_size. The buffer could potentially be placed into shared memory, for future projects. Large records that don't fit in the circular buffer are called "oversized" and allocated separately with palloc(). Discussion: https://postgr.es/m/CA+hUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq=AovOddfHpA@mail.gmail.com
* Remove read_page callback from XLogReader.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | Previously, the XLogReader module would fetch new input data using a callback function. Redesign the interface so that it tells the caller to insert more data with a special return value instead. This API suits later patches for prefetching, encryption and maybe other future projects that would otherwise require continually extending the callback interface. As incidental cleanup work, move global variables readOff, readLen and readSegNo inside XlogReaderState. Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> Author: Heikki Linnakangas <hlinnaka@iki.fi> (parts of earlier version) Reviewed-by: Antonin Houska <ah@cybertec.at> Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> Reviewed-by: Takashi Menjo <takashi.menjo@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/20190418.210257.43726183.horiguchi.kyotaro%40lab.ntt.co.jp
* autovacuum: handle analyze for partitioned tablesAlvaro Herrera2021-04-08
| | | | | | | | | | | | | | | | | | | Previously, autovacuum would completely ignore partitioned tables, which is not good regarding analyze -- failing to analyze those tables means poor plans may be chosen. Make autovacuum aware of those tables by propagating "changes since analyze" counts from the leaf partitions up the partitioning hierarchy. This also introduces necessary reloptions support for partitioned tables (autovacuum_enabled, autovacuum_analyze_scale_factor, autovacuum_analyze_threshold). It's unclear how best to document this aspect. Author: Yuzuko Hosoya <yuzukohosoya@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Tomas Vondra <tomas.vondra@enterprisedb.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/CAKkQ508_PwVgwJyBY=0Lmkz90j8CmWNPUxgHvCUwGhMrouz6UA@mail.gmail.com
* Teach VACUUM to bypass unnecessary index vacuuming.Peter Geoghegan2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | VACUUM has never needed to call ambulkdelete() for each index in cases where there are precisely zero TIDs in its dead_tuples array by the end of its first pass over the heap (also its only pass over the heap in this scenario). Index vacuuming is simply not required when this happens. Index cleanup will still go ahead, but in practice most calls to amvacuumcleanup() are usually no-ops when there were zero preceding ambulkdelete() calls. In short, VACUUM has generally managed to avoid index scans when there were clearly no index tuples to delete from indexes. But cases with _close to_ no index tuples to delete were another matter -- a round of ambulkdelete() calls took place (one per index), each of which performed a full index scan. VACUUM now behaves just as if there were zero index tuples to delete in cases where there are in fact "virtually zero" such tuples. That is, it can now bypass index vacuuming and heap vacuuming as an optimization (though not index cleanup). Whether or not VACUUM bypasses indexes is determined dynamically, based on the just-observed number of heap pages in the table that have one or more LP_DEAD items (LP_DEAD items in heap pages have a 1:1 correspondence with index tuples that still need to be deleted from each index in the worst case). We only skip index vacuuming when 2% or less of the table's pages have one or more LP_DEAD items -- bypassing index vacuuming as an optimization must not noticeably impede setting bits in the visibility map. As a further condition, the dead_tuples array (i.e. VACUUM's array of LP_DEAD item TIDs) must not exceed 32MB at the point that the first pass over the heap finishes, which is also when the decision to bypass is made. (The VACUUM must also have been able to fit all TIDs in its maintenance_work_mem-bound dead_tuples space, though with a default maintenance_work_mem setting it can't matter.) This avoids surprising jumps in the duration and overhead of routine vacuuming with workloads where successive VACUUM operations consistently have almost zero dead index tuples. The number of LP_DEAD items may well accumulate over multiple VACUUM operations, before finally the threshold is crossed and VACUUM performs conventional index vacuuming. Even then, the optimization will have avoided a great deal of largely unnecessary index vacuuming. In the future we may teach VACUUM to skip index vacuuming on a per-index basis, using a much more sophisticated approach. For now we only consider the extreme cases, where we can be quite confident that index vacuuming just isn't worth it using simple heuristics. Also log information about how many heap pages have one or more LP_DEAD items when autovacuum logging is enabled. Author: Masahiko Sawada <sawada.mshk@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmkebqPd4MVGuPTOS9bMFvp9MDs5cRTCOsv1rQJ3jCbXw@mail.gmail.com
* Add wraparound failsafe to VACUUM.Peter Geoghegan2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a failsafe mechanism that is triggered by VACUUM when it notices that the table's relfrozenxid and/or relminmxid are dangerously far in the past. VACUUM checks the age of the table dynamically, at regular intervals. When the failsafe triggers, VACUUM takes extraordinary measures to finish as quickly as possible so that relfrozenxid and/or relminmxid can be advanced. VACUUM will stop applying any cost-based delay that may be in effect. VACUUM will also bypass any further index vacuuming and heap vacuuming -- it only completes whatever remaining pruning and freezing is required. Bypassing index/heap vacuuming is enabled by commit 8523492d, which made it possible to dynamically trigger the mechanism already used within VACUUM when it is run with INDEX_CLEANUP off. It is expected that the failsafe will almost always trigger within an autovacuum to prevent wraparound, long after the autovacuum began. However, the failsafe mechanism can trigger in any VACUUM operation. Even in a non-aggressive VACUUM, where we're likely to not advance relfrozenxid, it still seems like a good idea to finish off remaining pruning and freezing. An aggressive/anti-wraparound VACUUM will be launched immediately afterwards. Note that the anti-wraparound VACUUM that follows will itself trigger the failsafe, usually before it even begins its first (and only) pass over the heap. The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and vacuum_multixact_failsafe_age. There are no equivalent reloptions, since that isn't expected to be useful. The GUCs have rather high defaults (both default to 1.6 billion), and are expected to generally only be used to make the failsafe trigger sooner/more frequently. Author: Masahiko Sawada <sawada.mshk@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
* Truncate line pointer array during VACUUM.Peter Geoghegan2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | Teach VACUUM to truncate the line pointer array of each heap page when a contiguous group of LP_UNUSED line pointers appear at the end of the array -- these unused and unreferenced items are excluded. This process occurs during VACUUM's second pass over the heap, right after LP_DEAD line pointers on the page (those encountered/pruned during the first pass) are marked LP_UNUSED. Truncation avoids line pointer bloat with certain workloads, particularly those involving continual range DELETEs and bulk INSERTs against the same table. Also harden heapam code to check for an out-of-range page offset number in places where we weren't already doing so. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com Discussion: https://postgr.es/m/CAH2-Wzn6a64PJM1Ggzm=uvx2otsopJMhFQj_g1rAj4GWr3ZSzw@mail.gmail.com
* Don't add non-existent pages to bitmap from BRINTomas Vondra2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in bringetbitmap() simply added the whole matching page range to the TID bitmap, as determined by pages_per_range, even if some of the pages were beyond the end of the heap. The query then might fail with an error like this: ERROR: could not open file "base/20176/20228.2" (target block 262144): previous segment is only 131021 blocks In this case, the relation has 262093 pages (131072 and 131021 pages), but we're trying to acess block 262144, i.e. first block of the 3rd segment. At that point _mdfd_getseg() notices the preceding segment is incomplete, and fails. Hitting this in practice is rather unlikely, because: * Most indexes use power-of-two ranges, so segments and page ranges align perfectly (segment end is also a page range end). * The table size has to be just right, with the last segment being almost full - less than one page range from full segment, so that the last page range actually crosses the segment boundary. * Prefetch has to be enabled. The regular page access checks that pages are not beyond heap end, but prefetch does not. On older releases (before 12) the execution stops after hitting the first non-existent page, so the prefetch distance has to be sufficient to reach the first page in the next segment to trigger the issue. Since 12 it's enough to just have prefetch enabled, the prefetch distance does not matter. Fixed by not adding non-existent pages to the TID bitmap. Backpatch all the way back to 9.6 (BRIN indexes were introduced in 9.5, but that release is EOL). Backpatch-through: 9.6
* Revert "Add sortsupport for gist_btree opclasses, for faster index builds."Heikki Linnakangas2021-04-07
| | | | | | | | | | This reverts commit 9f984ba6d23dc6eecebf479ab1d3f2e550a4e9be. It was making the buildfarm unhappy, apparently setting client_min_messages in a regression test produces different output if log_statement='all'. Another issue is that I now suspect the bit sortsupport function was in fact not correct to call byteacmp(). Revert to investigate both of those issues.
* Add sortsupport for gist_btree opclasses, for faster index builds.Heikki Linnakangas2021-04-07
| | | | | | | | | Commit 16fa9b2b30 introduced a faster way to build GiST indexes, by sorting all the data. This commit adds the sortsupport functions needed to make use of that feature for btree_gist. Author: Andrey Borodin Discussion: https://www.postgresql.org/message-id/2F3F7265-0D22-44DB-AD71-8554C743D943@yandex-team.ru
* Remove redundant memset(0) calls for page init of some index AMsMichael Paquier2021-04-07
| | | | | | | | | | | | | | | | | | | Bloom, GIN, GiST and SP-GiST rely on PageInit() to initialize the contents of a page, and this routine fills entirely a page with zeros for a size of BLCKSZ, including the special space. Those index AMs have been using an extra memset() call to fill with zeros the special page space, or even the whole page, which is not necessary as PageInit() already does this work, so let's remove them. GiST was not doing this extra call, but has commented out a system call that did so since 6236991. While on it, remove one MAXALIGN() for SP-GiST as PageInit() takes care of that. This makes the whole page initialization logic more consistent across all index AMs. Author: Bharath Rupireddy Reviewed-by: Vignesh C, Mahendra Singh Thalor Discussion: https://postgr.es/m/CALj2ACViOo2qyaPT7krWm4LRyRTw9kOXt+g6PfNmYuGA=YHj9A@mail.gmail.com
* Remove tupgone special case from vacuumlazy.c.Peter Geoghegan2021-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Retry the call to heap_prune_page() in rare cases where there is disagreement between the heap_prune_page() call and the call to HeapTupleSatisfiesVacuum() that immediately follows. Disagreement is possible when a concurrently-aborted transaction makes a tuple DEAD during the tiny window between each step. This was the only case where a tuple considered DEAD by VACUUM still had storage following pruning. VACUUM's definition of dead tuples is now uniformly simple and unambiguous: dead tuples from each page are always LP_DEAD line pointers that were encountered just after we performed pruning (and just before we considered freezing remaining items with tuple storage). Eliminating the tupgone=true special case enables INDEX_CLEANUP=off style skipping of index vacuuming that takes place based on flexible, dynamic criteria. The INDEX_CLEANUP=off case had to know about skipping indexes up-front before now, due to a subtle interaction with the special case (see commit dd695979) -- this was a special case unto itself. Now there are no special cases. And so now it won't matter when or how we decide to skip index vacuuming: it won't affect how pruning behaves, and it won't be affected by any of the implementation details of pruning or freezing. Also remove XLOG_HEAP2_CLEANUP_INFO records. These are no longer necessary because we now rely entirely on heap pruning taking care of recovery conflicts. There is no longer any need to generate recovery conflicts for DEAD tuples that pruning just missed. This also means that heap vacuuming now uses exactly the same strategy for recovery conflicts as index vacuuming always has: REDO routines never need to process a latestRemovedXid from the WAL record, since earlier REDO of the WAL record from pruning is sufficient in all cases. The generic XLOG_HEAP2_CLEAN record type is now split into two new record types to reflect this new division (these are called XLOG_HEAP2_PRUNE and XLOG_HEAP2_VACUUM). Also stop acquiring a super-exclusive lock for heap pages when they're vacuumed during VACUUM's second heap pass. A regular exclusive lock is enough. This is correct because heap page vacuuming is now strictly a matter of setting the LP_DEAD line pointers to LP_UNUSED. No other backend can have a pointer to a tuple located in a pinned buffer that can be invalidated by a concurrent heap page vacuum operation. Heap vacuuming can now be thought of as conceptually similar to index vacuuming and conceptually dissimilar to heap pruning. Heap pruning now has sole responsibility for anything involving the logical contents of the database (e.g., managing transaction status information, recovery conflicts, considering what to do with HOT chains). Index vacuuming and heap vacuuming are now only concerned with recycling garbage items from physical data structures that back the logical database. Bump XLOG_PAGE_MAGIC due to pruning and heap page vacuum WAL record changes. Credit for the idea of retrying pruning a page to avoid the tupgone case goes to Andres Freund. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
* Refactor lazy_scan_heap() loop.Peter Geoghegan2021-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a lazy_scan_heap() subsidiary function that handles heap pruning and tuple freezing: lazy_scan_prune(). This is a great deal cleaner. The code that remains in lazy_scan_heap()'s per-block loop can now be thought of as code that either comes before or after the call to lazy_scan_prune(), which is now the clear focal point. This division is enforced by the way in which we now manage state. lazy_scan_prune() outputs state (using its own struct) that describes what to do with the page following pruning and freezing (e.g., visibility map maintenance, recording free space in the FSM). It doesn't get passed any special instructional state from the preamble code, though. Also cleanly separate the logic used by a VACUUM with INDEX_CLEANUP=off from the logic used by single-heap-pass VACUUMs. The former case is now structured as the omission of index and heap vacuuming by a two pass VACUUM. The latter case goes back to being used only when the table happens to have no indexes (just as it was before commit a96c41fe). This structure is much more natural, since the whole point of INDEX_CLEANUP=off is to skip the index and heap vacuuming that would otherwise take place. The single-heap-pass case doesn't skip any useful work, though -- it just does heap pruning and heap vacuuming together when the table happens to have no indexes. Both of these changes are preparation for an upcoming patch that generalizes the mechanism used by INDEX_CLEANUP=off. The later patch will allow VACUUM to give up on index and heap vacuuming dynamically, as problems emerge (e.g., with wraparound), so that an affected VACUUM operation can finish up as soon as possible. Also fix a very old bug in single-pass VACUUM VERBOSE output. We were reporting the number of tuples deleted via pruning as a direct substitute for reporting the number of LP_DEAD items removed in a function that deals with the second pass over the heap. But that doesn't work at all -- they're two different things. To fix, start tracking the total number of LP_DEAD items encountered during pruning, and use that in the report instead. A single pass VACUUM will always vacuum away whatever LP_DEAD items a heap page has immediately after it is pruned, so the total number of LP_DEAD items encountered during pruning equals the total number vacuumed-away. (They are _not_ equal in the INDEX_CLEANUP=off case, but that's okay because skipping index vacuuming is now a totally orthogonal concept to one-pass VACUUM.) Also stop reporting the count of LP_UNUSED items in VACUUM VERBOSE output. This makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (because it never showed information about LP_UNUSED items). VACUUM VERBOSE reported LP_UNUSED items left behind by the last VACUUM, and LP_UNUSED items created via pruning HOT chains during the current VACUUM (it never included LP_UNUSED items left behind by the current VACUUM's second pass over the heap). This makes it useless as an indicator of line pointer bloat, which must have been the original intention. (Like the first VACUUM VERBOSE issue, this issue was arguably an oversight in commit 282d2a03, which added the heap-only tuple optimization.) Finally, stop reporting empty_pages in VACUUM VERBOSE output, and start reporting pages_removed instead. This also makes the output of VACUUM VERBOSE more consistent with log_autovacuum's output (which does not show empty_pages, but does show pages_removed). An empty page isn't meaningfully different to a page that is almost empty, or a page that is empty but for only a small number of remaining LP_UNUSED items. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-WznneCXTzuFmcwx_EyRQgfsfJAAsu+CsqRFmFXCAar=nJw@mail.gmail.com
* Clean up treatment of missing default and CHECK-constraint records.Tom Lane2021-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrew Gierth reported that it's possible to crash the backend if no pg_attrdef record is found to match an attribute that has atthasdef set. AttrDefaultFetch warns about this situation, but then leaves behind a relation tupdesc that has null "adbin" pointer(s), which most places don't guard against. We considered promoting the warning to an error, but throwing errors during relcache load is pretty drastic: it effectively locks one out of using the relation at all. What seems better is to leave the load-time behavior as a warning, but then throw an error in any code path that wants to use a default and can't find it. This confines the error to a subset of INSERT/UPDATE operations on the table, and in particular will at least allow a pg_dump to succeed. Also, we should fix AttrDefaultFetch to not leave any null pointers in the tupdesc, because that just creates an untested bug hazard. While at it, apply the same philosophy of "warn at load, throw error only upon use of the known-missing info" to CHECK constraints. CheckConstraintFetch is very nearly the same logic as AttrDefaultFetch, but for reasons lost in the mists of time, it was throwing ERROR for the same cases that AttrDefaultFetch treats as WARNING. Make the two functions more nearly alike. In passing, get rid of potentially-O(N^2) loops in equalTupleDesc by making AttrDefaultFetch sort the entries after fetching them, so that equalTupleDesc can assume that entries in two equal tupdescs must be in matching order. (CheckConstraintFetch already was sorting CHECK constraints, but equalTupleDesc hadn't been told about it.) There's some argument for back-patching this, but with such a small number of field reports, I'm content to fix it in HEAD. Discussion: https://postgr.es/m/87pmzaq4gx.fsf@news-spur.riddles.org.uk
* Stop archive recovery if WAL generated with wal_level=minimal is found.Fujii Masao2021-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously if hot standby was enabled, archive recovery exited with an error when it found WAL generated with wal_level=minimal. But if hot standby was disabled, it just reported a warning and continued in that case. Which could lead to data loss or errors during normal operation. A warning was emitted, but users could easily miss that and not notice this serious situation until they encountered the actual errors. To improve this situation, this commit changes archive recovery so that it exits with FATAL error when it finds WAL generated with wal_level=minimal whatever the setting of hot standby. This enables users to notice the serious situation soon. The FATAL error is thrown if archive recovery starts from a base backup taken before wal_level is changed to minimal. When archive recovery exits with the error, if users have a base backup taken after setting wal_level to higher than minimal, they can recover the database by starting archive recovery from that newer backup. But note that if such backup doesn't exist, there is no easy way to complete archive recovery, which may make the database server unstartable and users may lose whole database. The commit adds the note about this risk into the document. Even in the case of unstartable database server, previously by just disabling hot standby users could avoid the error during archive recovery, forcibly start up the server and salvage data from it. But note that this commit makes this procedure unavailable at all. Author: Takamichi Osumi Reviewed-by: Laurenz Albe, Kyotaro Horiguchi, David Steele, Fujii Masao Discussion: https://postgr.es/m/OSBPR01MB4888CBE1DA08818FD2D90ED8EDF90@OSBPR01MB4888.jpnprd01.prod.outlook.com
* Allocate access strategy in parallel VACUUM workers.Peter Geoghegan2021-04-05
| | | | | | | | | | | | | | | | | Commit 49f49def took entirely the wrong approach to fixing this issue. Just allocate a local buffer access strategy in each individual worker instead of trying to propagate state. This state was never propagated by parallel VACUUM in the first place. It looks like the only reason that this worked following commit 40d964ec was that it involved static global variables, which are initialized to 0 per the C standard. A more comprehensive fix may be necessary, even on HEAD. This fix should at least get the buildfarm green once again. Thanks once again to Thomas Munro for continued off-list assistance with the issue.
* Support INCLUDE'd columns in SP-GiST.Tom Lane2021-04-05
| | | | | | | | | | | | | Not much to say here: does what it says on the tin. We steal a previously-always-zero bit from the nextOffset field of leaf index tuples in order to track whether there is a nulls bitmap. Otherwise it works about like included columns in other index types. Pavel Borisov, reviewed by Andrey Borodin and Anastasia Lubennikova, and rather heavily editorialized on by me Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
* Propagate parallel VACUUM's buffer access strategy.Peter Geoghegan2021-04-05
| | | | | | | | | | | | | Parallel VACUUM relied on global variable state from the leader process being propagated to workers on fork(). Commit b4af70cb removed most uses of global variables inside vacuumlazy.c, but did not account for the buffer access strategy state. To fix, propagate the state through shared memory instead. Per buildfarm failures on elver, curculio, and morepork. Many thanks to Thomas Munro for off-list assistance with this issue.
* Simplify state managed by VACUUM.Peter Geoghegan2021-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reorganize the state struct used by VACUUM -- group related items together to make it easier to understand. Also stop relying on stack variables inside lazy_scan_heap() -- move those into the state struct instead. Doing things this way simplifies large groups of related functions whose function signatures had a lot of unnecessary redundancy. Switch over to using int64 for the struct fields used to count things that are reported to the user via log_autovacuum and VACUUM VERBOSE output. We were using double, but that doesn't seem to have any advantages. Using int64 makes it possible to add assertions that verify that the first pass over the heap (pruning) encounters precisely the same number of LP_DEAD items that get deleted from indexes later on, in the second pass over the heap. These assertions will be added in later commits. Finally, adjust the signatures of functions with IndexBulkDeleteResult pointer arguments in cases where there was ambiguity about whether or not the argument relates to a single index or all indexes. Functions now use the idiom that both ambulkdelete() and amvacuumcleanup() have always used (where appropriate): accept a mutable IndexBulkDeleteResult pointer argument, and return a result IndexBulkDeleteResult pointer to caller. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkeOSYwC6KNckbhk2b1aNnWum6Yyn0NKP9D-Hq1LGTDPw@mail.gmail.com
* Fix more confusion in SP-GiST.Tom Lane2021-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | spg_box_quad_leaf_consistent unconditionally returned the leaf datum as leafValue, even though in its usage for poly_ops that value is of completely the wrong type. In versions before 12, that was harmless because the core code did nothing with leafValue in non-index-only scans ... but since commit 2a6368343, if we were doing a KNN-style scan, spgNewHeapItem would unconditionally try to copy the value using the wrong datatype parameters. Said copying is a waste of time and space if we're not going to return the data, but it accidentally failed to fail until I fixed the datatype confusion in ac9099fc1. Hence, change spgNewHeapItem to not copy the datum unless we're actually going to return it later. This saves cycles and dodges the question of whether lossy opclasses are returning the right type. Also change spg_box_quad_leaf_consistent to not return data that might be of the wrong type, as insurance against somebody introducing a similar bug into the core code in future. It seems like a good idea to back-patch these two changes into v12 and v13, although I'm afraid to change spgNewHeapItem's mistaken idea of which datatype to use in those branches. Per buildfarm results from ac9099fc1. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
* Fix confusion in SP-GiST between attribute type and leaf storage type.Tom Lane2021-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to the documentation, the attType passed to the opclass config function (and also relied on by the core code) is the type of the heap column or expression being indexed. But what was actually being passed was the type stored for the index column. This made no difference for user-defined SP-GiST opclasses, because we weren't allowing the STORAGE clause of CREATE OPCLASS to be used, so the two types would be the same. But it's silly not to allow that, seeing that the built-in poly_ops opclass has a different value for opckeytype than opcintype, and that if you want to do lossy storage then the types must really be different. (Thus, user-defined opclasses doing lossy storage had to lie about what type is in the index.) Hence, remove the restriction, and make sure that we use the input column type not opckeytype where relevant. For reasons of backwards compatibility with existing user-defined opclasses, we can't quite insist that the specified leafType match the STORAGE clause; instead just add an amvalidate() warning if they don't match. Also fix some bugs that would only manifest when trying to return index entries when attType is different from attLeafType. It's not too surprising that these have not been reported, because the only usual reason for such a difference is to store the leaf value lossily, rendering index-only scans impossible. Add a src/test/modules module to exercise cases where attType is different from attLeafType and yet index-only scan is supported. Discussion: https://postgr.es/m/3728741.1617381471@sss.pgh.pa.us
* Fix bug in brin_minmax_multi_unionTomas Vondra2021-04-04
| | | | | | | | | | | When calling sort_expanded_ranges() we need to remember the return value, because the function sorts and also deduplicates the ranges. So the number of ranges may decrease. brin_minmax_multi_union failed to do that, which resulted in crashes due to bogus ranges (equal minval/maxval but not marked as compacted). Reported-by: Jaime Casanova Discussion: https://postgr.es/m/20210404052550.GA4376%40ahch-to
* Fix order of parameters in BRIN minmax-multi callsTomas Vondra2021-04-04
| | | | | | | | | | | | The BRIN minmax-multi consistent function incorrectly assumed it can lookup an operator, and then swap the arguments to get the commutator. For example <(a,b) would be called as <(b,a) to get >(a,b). This works when the arguments are of the same type, but with cross-type opclasses this fails. We can't swap <(float4,float8) arguments, for example. Fixed by passing arguments in the right order. Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
* Fix BRIN minmax-multi distance for inet typeTomas Vondra2021-04-04
| | | | | | | | | | The distance calculation ignored the mask, unlike the inet comparator, which resulted in negative distance in some cases. Fixed by applying the mask in brin_minmax_multi_distance_inet. I've considered simply calling inetmi() to calculate the delta, but that does not consider mask either. Reviewed-by: Zhihong Yu Discussion: https://postgr.es/m/1a0a7b9d-9bda-e3a2-7fa4-88f15042a051%40enterprisedb.com
* Fix BRIN minmax-multi distance for timetz typeTomas Vondra2021-04-04
| | | | | | | | | The distance calculation ignored the time zone, so the result of (b-a) might have ended negative even if (b > a). Fixed by considering the time zone difference. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jLZFLCxyxfT%3DMfK5mtPfSzHA1rVLowR-j4RRsFVvKm7A%40mail.gmail.com
* Fix BRIN minmax-multi distance for interval typeTomas Vondra2021-04-04
| | | | | | | | | | | | The distance calculation for interval type was treating months as having 31 days, which is inconsistent with the interval comparator (using 30 days). Due to this it was possible to get negative distance (b-a) when (a<b), trigerring an assert. Fixed by adopting the same logic as interval_cmp_value. Reported-by: Jaime Casanova Discussion: https://postgr.es/m/CAJKUy5jKH0Xhneau2mNftNPtTy-BVgQfXc8zQkEvRvBHfeUThQ%40mail.gmail.com
* Rethink handling of pass-by-value leaf datums in SP-GiST.Tom Lane2021-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing convention in SP-GiST is that any pass-by-value datatype is stored in Datum representation, i.e. it's of width sizeof(Datum) even when typlen is less than that. This is okay, or at least it's too late to change it, for prefix datums and node-label datums in inner (upper) tuples. But it's problematic for leaf datums, because we'd prefer those to be stored in Postgres' standard on-disk representation so that we can easily extend leaf tuples to carry additional "included" columns. I believe, however, that we can get away with just up and changing that. This would be an unacceptable on-disk-format break, but there are two big mitigating factors: 1. It seems quite unlikely that there are any SP-GiST opclasses out there that use pass-by-value leaf datatypes. Certainly none of the ones in core do, nor has codesearch.debian.net heard of any. Given what SP-GiST is good for, it's hard to conceive of a use-case where the leaf-level values would be both small and fixed-width. (As an example, if you wanted to index text values with the leaf level being just a byte, then every text string would have to be represented with one level of inner tuple per preceding byte, which would be horrendously space-inefficient and slow to access. You always want to use as few inner-tuple levels as possible, leaving as much as possible in the leaf values.) 2. Even granting that you have such an index, this change only breaks things on big-endian machines. On little-endian, the high order bytes of the Datum format will now just appear to be alignment padding space. So, change the code to store pass-by-value leaf datums in their usual on-disk form. Inner-tuple datums are not touched. This is extracted from a larger patch that intends to add support for "included" columns. I'm committing it separately for visibility in our commit logs. Pavel Borisov and Tom Lane, reviewed by Andrey Borodin Discussion: https://postgr.es/m/CALT9ZEFi-vMp4faht9f9Junb1nO3NOSjhpxTmbm1UGLMsLqiEQ@mail.gmail.com
* Accept slightly-filled pages for tuples larger than fillfactor.Noah Misch2021-03-30
| | | | | | | | | | | | | | | We always inserted a larger-than-fillfactor tuple into a newly-extended page, even when existing pages were empty or contained nothing but an unused line pointer. This was unnecessary relation extension. Start tolerating page usage up to 1/8 the maximum space that could be taken up by line pointers. This is somewhat arbitrary, but it should allow more cases to reuse pages. This has no effect on tables with fillfactor=100 (the default). John Naylor and Floris van Nee. Reviewed by Matthias van de Meent. Reported by Floris van Nee. Discussion: https://postgr.es/m/6e263217180649339720afe2176c50aa@opammb0562.comp.optiver.com
* Adjust design of per-worker parallel seqscan data structDavid Rowley2021-03-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | The design of the data structures which allow storage of the per-worker memory during parallel seq scans were not ideal. The work done in 56788d215 required an additional data structure to allow workers to remember the range of pages that had been allocated to them for processing during a parallel seqscan. That commit added a void pointer field to TableScanDescData to allow heapam to store the per-worker allocation information. However putting the field there made very little sense given that we have AM specific structs for that, e.g. HeapScanDescData. Here we remove the void pointer field from TableScanDescData and add a dedicated field for this purpose to HeapScanDescData. Previously we also allocated memory for this parallel per-worker data for all scans, regardless if it was a parallel scan or not. This was just a wasted allocation for non-parallel scans, so here we make the allocation conditional on the scan being parallel. Also, add previously missing pfree() to free the per-worker data in heap_endscan(). Reported-by: Andres Freund Reviewed-by: Andres Freund Discussion: https://postgr.es/m/20210317023101.anvejcfotwka6gaa@alap3.anarazel.de
* Fix alignment in BRIN minmax-multi deserializationTomas Vondra2021-03-26
| | | | | | | | | | | | | | | The deserialization failed to ensure correct alignment, as it assumed it can simply point into the serialized value. The serialization however ignores alignment and copies just the significant bytes in order to make the result as small as possible. This caused failures on systems that are sensitive to mialigned addresses, like sparc, or with address sanitizer enabled. Fixed by copying the serialized data to ensure proper alignment. While at it, fix an issue with serialization on big endian machines, using the same store_att_byval/fetch_att trick as extended statistics. Discussion: https://postgr.es/0c8c3304-d3dd-5e29-d5ac-b50589a23c8c%40enterprisedb.com
* BRIN minmax-multi indexesTomas Vondra2021-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds BRIN opclasses similar to the existing minmax, except that instead of summarizing the page range into a single [min,max] range, the summary consists of multiple ranges and/or points, allowing gaps. This allows more efficient handling of data with poor correlation to physical location within the table and/or outlier values, for which the regular minmax opclassed tend to work poorly. It's possible to specify the number of values kept for each page range, either as a single point or an interval boundary. CREATE TABLE t (a int); CREATE INDEX ON t USING brin (a int4_minmax_multi_ops(values_per_range=16)); When building the summary, the values are combined into intervals with the goal to minimize the "covering" (sum of interval lengths), using a support procedure computing distance between two values. Bump catversion, due to various catalog changes. Author: Tomas Vondra <tomas.vondra@postgresql.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
* BRIN bloom indexesTomas Vondra2021-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a BRIN opclass using a Bloom filter to summarize the range. Indexes using the new opclasses allow only equality queries (similar to hash indexes), but that works fine for data like UUID, MAC addresses etc. for which range queries are not very common. This also means the indexes work for data that is not well correlated to physical location within the table, or perhaps even entirely random (which is a common issue with existing BRIN minmax opclasses). It's possible to specify opclass parameters with the usual Bloom filter parameters, i.e. the desired false-positive rate and the expected number of distinct values per page range. CREATE TABLE t (a int); CREATE INDEX ON t USING brin (a int4_bloom_ops(false_positive_rate = 0.05, n_distinct_per_range = 100)); The opclasses do not operate on the indexed values directly, but compute a 32-bit hash first, and the Bloom filter is built on the hash value. Collisions should not be a huge issue though, as the number of distinct values in a page ranges is usually fairly small. Bump catversion, due to various catalog changes. Author: Tomas Vondra <tomas.vondra@postgresql.org> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Sokolov Yura <y.sokolov@postgrespro.ru> Reviewed-by: Nico Williams <nico@cryptonector.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com Discussion: https://postgr.es/m/5d78b774-7e9c-c94e-12cf-fef51cc89b1a%402ndquadrant.com
* Support the old signature of BRIN consistent functionTomas Vondra2021-03-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a1c649d889 changed the signature of the BRIN consistent function by adding a new required parameter. Treating the parameter as optional, which would make the change backwards incompatibile, was rejected with the justification that there are few out-of-core extensions, so it's not worth adding making the code more complex, and it's better to deal with that in the extension. But after further thought, that would be rather problematic, because pg_upgrade simply dumps catalog contents and the same version of an extension needs to work on both PostgreSQL versions. Supporting both variants of the consistent function (with 3 or 4 arguments) makes that possible. The signature is not the only thing that changed, as commit 72ccf55cb9 moved handling of IS [NOT] NULL keys from the support procedures. But this change is backward compatible - handling the keys in exension is unnecessary, but harmless. The consistent function will do a bit of unnecessary work, but it should be very cheap. This also undoes most of the changes to the existing opclasses (minmax and inclusion), making them use the old signature again. This should make backpatching simpler. Catversion bump, because of changes in pg_amproc. Author: Tomas Vondra <tomas.vondra@postgresql.org> Author: Nikita Glukhov <n.gluhov@postgrespro.ru> Reviewed-by: Mark Dilger <hornschnorter@gmail.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com> Reviewed-by: Masahiko Sawada <masahiko.sawada@enterprisedb.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/c1138ead-7668-f0e1-0638-c3be3237e812@2ndquadrant.com
* Fix interaction of TOAST compression with expression indexes.Robert Haas2021-03-25
| | | | | | | | | Before, trying to compress a value for insertion into an expression index would crash. Dilip Kumar, with some editing by me. Report by Jaime Casanova. Discussion: http://postgr.es/m/CAJKUy5gcs0zGOp6JXU2mMVdthYhuQpFk=S3V8DOKT=LZC1L36Q@mail.gmail.com
* Sanitize the term "combo CID" in code commentsMichael Paquier2021-03-25
| | | | | | | | | Combo CIDs were referred in the code comments using different terms across various places of the code, so unify a bit the term used with what is currently in use in some of the READMEs. Author: "Hou, Zhijie" Discussion: https://postgr.es/m/1d42865c91404f46af4562532fdbea31@G08CNEXMBPEKD05.g08.fujitsu.local
* Fix bug in WAL replay of COMMIT_TS_SETTS record.Fujii Masao2021-03-25
| | | | | | | | | | | | | | | | | | Previously the WAL replay of COMMIT_TS_SETTS record called TransactionTreeSetCommitTsData() with the argument write_xlog=true, which generated and wrote new COMMIT_TS_SETTS record. This should not be acceptable because it's during recovery. This commit fixes the WAL replay of COMMIT_TS_SETTS record so that it calls TransactionTreeSetCommitTsData() with write_xlog=false and doesn't generate new WAL during recovery. Back-patch to all supported branches. Reported-by: lx zou <zoulx1982@163.com> Author: Fujii Masao Reviewed-by: Alvaro Herrera Discussion: https://postgr.es/m/16931-620d0f2fdc6108f1@postgresql.org
* Fix stray double semicolonsPeter Eisentraut2021-03-24
| | | | Reported-by: John Naylor <john.naylor@enterprisedb.com>
* Tidy up more loose ends related to configurable TOAST compression.Robert Haas2021-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Change the default_toast_compression GUC to be an enum rather than a string. Earlier, uncommitted versions of the patch supported using CREATE ACCESS METHOD to add new compression methods to a running system, but that idea was dropped before commit. So, we can simplify the GUC handling as well, which has the nice side effect of improving the error messages. While updating the documentation to reflect the new GUC type, also move it back to the right place in the list. I moved this while revising what became commit 24f0e395ac5892cd12e8914646fe921fac5ba23d, but apparently the intended ordering is "alphabetical" rather than "whatever Robert thinks looks nice." Rejigger things to avoid having access/toast_compression.h depend on utils/guc.h, so that we don't end up with every file that includes it also depending on something largely unrelated. Move a few inline functions back into the C source file partly to help reduce dependencies and partly just to avoid clutter. A few very minor cosmetic fixes. Original patch by Justin Pryzby, but very heavily edited by me, and reverse reviewed by him and also reviewed by by Tom Lane. Discussion: http://postgr.es/m/CA+TgmoYp=GT_ztUCeZg2i4hkHAQv8o=-nVJ1-TKWTG1zQOmOpg@mail.gmail.com
* Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."Amit Kapila2021-03-24
| | | | | | | | | | | | | | | | | | | To allow inserts in parallel-mode this feature has to ensure that all the constraints, triggers, etc. are parallel-safe for the partition hierarchy which is costly and we need to find a better way to do that. Additionally, we could have used existing cached information in some cases like indexes, domains, etc. to determine the parallel-safety. List of commits reverted, in reverse chronological order: ed62d3737c Doc: Update description for parallel insert reloption. c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode. c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f. e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f. e4e87a32cc Fix valgrind issue in commit 05c8482f7f. 05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...". Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
* Reword slightly logs generated for index stats in autovacuumMichael Paquier2021-03-24
| | | | | | | | | Using "remain" is confusing, as it implies that the index file can shrink. Instead, use "in total". Per discussion with Peter Geoghegan. Discussion: https://postgr.es/m/CAH2-WzkYgHZzpGOwR14CScJsjaQpvJrEkEfkh_=wGhzLb=yVdQ@mail.gmail.com
* nbtree VACUUM: Cope with buggy opclasses.Peter Geoghegan2021-03-23
| | | | | | | | | | | | | Teach nbtree VACUUM to press on with vacuuming in the event of a page deletion attempt that fails to "re-find" a downlink for its child/target page. There is no good reason to treat this as an irrecoverable error. But there is a good reason not to: pressing on at this point removes any question of VACUUM not making progress solely due to misbehavior from user-defined operator class code. Discussion: https://postgr.es/m/CAH2-Wzma5G9CTtMjbrXTwOym+U=aWg-R7=-htySuztgoJLvZXg@mail.gmail.com
* Avoid possible crash while finishing up a heap rewrite.Tom Lane2021-03-23
| | | | | | | | | | | | | | | | | | | | end_heap_rewrite was not careful to ensure that the target relation is open at the smgr level before performing its final smgrimmedsync. In ordinary cases this is no problem, because it would have been opened earlier during the rewrite. However a crash can be reproduced by re-clustering an empty table with CLOBBER_CACHE_ALWAYS enabled. Although that exact scenario does not crash in v13, I think that's a chance result of unrelated planner changes, and the problem is likely still reachable with other test cases. The true proximate cause of this failure is commit c6b92041d, which replaced a call to heap_sync (which was careful about opening smgr) with a direct call to smgrimmedsync. Hence, back-patch to v13. Amul Sul, per report from Neha Sharma; cosmetic changes and test case by me. Discussion: https://postgr.es/m/CANiYTQsU7yMFpQYnv=BrcRVqK_3U3mtAzAsJCaqtzsDHfsUbdQ@mail.gmail.com
* Add per-index stats information in verbose logs of autovacuumMichael Paquier2021-03-23
| | | | | | | | | | | | | | | | | | | Once a relation's autovacuum is completed, the logs include more information about this relation state if the threshold of log_autovacuum_min_duration (or its relation option) is reached, with for example contents about the statistics of the VACUUM operation for the relation, WAL and system usage. This commit adds more information about the statistics of the relation's indexes, with one line of logs generated for each index. The index stats were already calculated, but not printed in the context of autovacuum yet. While on it, some refactoring is done to keep track of the index statistics directly within LVRelStats, simplifying some routines related to parallel VACUUMs. Author: Masahiko Sawada Reviewed-by: Michael Paquier, Euler Taveira Discussion: https://postgr.es/m/CAD21AoAy6SxHiTivh5yAPJSUE4S=QRPpSZUdafOSz0R+fRcM6Q@mail.gmail.com