postgresql - postgresql mirror

	Commit message (Collapse)	Author	Age
...
*	Don't try to fix eliminated nbtree array scan keys.	Peter Geoghegan	2024-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Preprocessing for nbtree index scans allowed array "input" scan keys already marked eliminated during array-specific preprocessing to be "fixed up" during preprocessing proper. This allowed eliminated scan keys on DESC index columns to spurious have their strategy commuted, causing assertion failures. To fix, teach _bt_fix_scankey_strategy to ignore these scan keys. This brings it in line with its only caller, _bt_preprocess_keys. Oversight in commit 5bf748b8, which enhanced nbtree ScalarArrayOp execution. Reported-By: Donghang Lin <donghanglin@gmail.com> Discussion: https://postgr.es/m/CAA=D8a2sHK6CAzZ=0CeafC-Y-MFXbYxnRSHvZTi=+JHu6kAa8Q@mail.gmail.com
*	Refactoring for CommitTransactionCommand()/AbortCurrentTransaction()	Alexander Korotkov	2024-04-18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fefd9a3fed turned tail recursion of CommitTransactionCommand() and AbortCurrentTransaction() into iteration. However, it splits the handling of cases between different functions. This commit puts the handling of all the cases into AbortCurrentTransactionInternal() and CommitTransactionCommandInternal(). Now CommitTransactionCommand() and AbortCurrentTransaction() are just doing the repeated calls of internal functions. Reported-by: Andres Freund Discussion: https://postgr.es/m/20240415224834.w6piwtefskoh32mv%40awork3.anarazel.de Author: Andres Freund
*	Cleanup parallel BRIN index build code	Tomas Vondra	2024-04-17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit b43757171470 added support for parallel builds of BRIN indexes, using code similar to BTREE. But there were to be a couple unnecessary differences, particularly in how the leader waits for the workers, and merges the results. So remove these, to make the code more similar. The leader never waited on the workersdonecv condition variable, but simply called WaitForParallelWorkersToFinish() in _brin_end_parallel() and then merged the per-worker results. This worked correctly, but it seems better to do the wait and merge before _brin_end_parallel(). This commit moves the relevant code to _brin_parallel_heapscan/merge(), which means _brin_end_parallel() remains responsible only for exiting the parallel mode and accumulating WAL usage data. Discussion: https://postgr.es/m/3733d042-71e1-6ae6-5fac-00c12db62db6@enterprisedb.com
*	Fix nbtree "deduce NOT NULL" scan key comment.	Peter Geoghegan	2024-04-16
\| \| \| \|	Oversight in commit c9c0589fda.
*	revert: Generalize relation analyze in table AM interface	Alexander Korotkov	2024-04-16
\| \| \| \| \| \|	This commit reverts 27bc1772fc and dd1f6b0c17. Per review by Andres Freund. Discussion: https://postgr.es/m/20240415201057.khoyxbwwxfgzomeo%40awork3.anarazel.de
*	Use the correct PG_DETOAST_DATUM macro in BRIN	Tomas Vondra	2024-04-14
\| \| \| \| \| \| \| \| \| \| \| \|	Commit 6bcda4a721 replaced PG_DETOAST_DATUM with PG_DETOAST_DATUM_PACKED in two BRIN output functions, for minmax-multi and bloom opclasses. But this is incorrect - the code is accessing the data through structs that already include a 4B header, so the detoast needs to match that. But the PACKED macro may keep the 1B header, which means the struct fields will point to incorrect data. Backpatch-through: 16 Discussion: https://postgr.es/m/1df00a66-db5a-4e66-809a-99b386a06d86%40enterprisedb.com
*	Update nbits_set in brin_bloom_union	Tomas Vondra	2024-04-14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Properly update the number of bits set in the bitmap after merging the filters in brin_bloom_union. This is mostly harmless, as the counter is used only in the output function, which means pageinspect may show incorrect information about the BRIN summary. The counter does not affect correctness. Discovered while adding a regression test comparing indexes built with and without parallelism. The parallel index builds exercise the union procedure when merging results from workers, which is otherwise very hard to do in a test. Which is why this went unnoticed until now. Backpatch through 14, where the BRIN bloom opclasses were introduced. Backpatch-through: 14 Discussion: https://postgr.es/m/1df00a66-db5a-4e66-809a-99b386a06d86%40enterprisedb.com
*	Revert: Implement pg_wal_replay_wait() stored procedure	Alexander Korotkov	2024-04-11
\| \| \| \| \| \| \|	This commit reverts 06c418e163, e37662f221, bf1e650806, 25f42429e2, ee79928441, and 74eaf66f98 per review by Heikki Linnakangas. Discussion: https://postgr.es/m/b155606b-e744-4218-bda5-29379779da1a%40iki.fi
*	Revert: Allow table AM to store complex data structures in rd_amcache	Alexander Korotkov	2024-04-11
\| \| \| \| \| \|	This commit reverts 02eb07ea89 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
*	Revert: Allow table AM tuple_insert() method to return the different slot	Alexander Korotkov	2024-04-11
\| \| \| \| \| \|	This commit reverts c35a3fb5e0 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
*	Revert: Allow locking updated tuples in tuple_update() and tuple_delete()	Alexander Korotkov	2024-04-11
\| \| \| \| \| \|	This commit reverts 87985cc925 and 818861eb57 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
*	Revert: Let table AM insertion methods control index insertion	Alexander Korotkov	2024-04-11
\| \| \| \| \| \|	This commit reverts b1484a3f19 per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
*	Revert: Custom reloptions for table AM	Alexander Korotkov	2024-04-11
\| \| \| \| \| \|	This commit reverts 9bd99f4c26 and 422041542f per review by Andres Freund. Discussion: https://postgr.es/m/20240410165236.rwyrny7ihi4ddxw4%40awork3.anarazel.de
*	Fix inconsistency with replay of hash squeeze record for clean buffers	Michael Paquier	2024-04-11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aa5edbe379d6 has tweaked _hash_freeovflpage() so as the write buffer's LSN is updated only when necessary, when REGBUF_NO_CHANGE is not used. The replay code was not consistent with that, causing the write buffer's LSN to be updated and its page to be marked as dirty even if the buffer was registered in a "clean" state. This was possible for the case of a squeeze record when there are no tuples to add to the write buffer, for (is_prim_bucket_same_wrt && !is_prev_bucket_same_wrt). I have performed some validation of this commit with wal_consistency_checking and a change in WAL that logs REGBUF_NO_CHANGE to a new BKPIMAGE_*. Thanks to that, it is possible to know at replay if a buffer was clean when it was registered, then cross-checked the LSN of the "clean" page copy coming from WAL with the LSN of the block once the record has been replayed. This eats one bit in bimg_info, which is not acceptable to be integrated as-is, but it could become handy in the future. I didn't spot other areas than the one fixed by this commit at the extent of what the main regression test suite covers. As this is an oversight in aa5edbe379d6, no backpatch is required. Reported-by: Zubeyr Eryilmaz Author: Hayato Kuroda Reviewed-by: Amit Kapila, Michael Paquier Discussion: https://postgr.es/m/ZbyVVG_7eW3YD5-A@paquier.xyz
*	Get rid of anonymous struct	John Naylor	2024-04-09
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a C11 feature, and we require C99. While at it, go the further step and get rid of the surrounding union (with uintptr_t) entirely, as there is currently no use case for this file to access the header of BlocktableEntry as a uintptr_t, and there are no additional alignment requirements. The least invasive way seems to be to transfer the old union name to this struct. Reported by Pavel Borisov and Andres Freund, per buildfarm member mylodon Reviewed by Pavel Borisov Discussion: https://postgr.es/m/CALT9ZEH11NYV8AOzKb1bWhCf6J0H=H31f0MgT9xX+HdqvcA1rw@mail.gmail.com
*	Teach radix tree to embed values at runtime	John Naylor	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, the decision to store values in leaves or within the child pointer was made at compile time, with variable length values using leaves by necessity. This commit allows introspecting the length of variable length values at runtime for that decision. This requires the ability to tell whether the last-level child pointer is actually a value, so we use a pointer tag in the lowest level bit. Use this in TID store. This entails adding a byte to the header to reserve space for the tag. Commit f35bd9bf3 stores up to three offsets within the header with no bitmap, and now the header can be embedded as above. This reduces worst-case memory usage when TIDs are sparse. Reviewed (in an earlier version) by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
*	Teach TID store to skip bitmap for small numbers of offsets	John Naylor	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \|	The header portion of BlocktableEntry has enough padding space for an array of 3 offsets (1 on 32-bit platforms). Use this space instead of having a sparse bitmap array. This will take up a constant amount of space no matter what the offsets are. Reviewed (in an earlier version) by Masahiko Sawada Discussion: https://postgr.es/m/CANWCAZYw+_KAaUNruhJfE=h6WgtBKeDG32St8vBJBEY82bGVRQ@mail.gmail.com Discussion: https://postgr.es/m/CAD21AoBci3Hujzijubomo1tdwH3XtQ9F89cTNQ4bsQijOmqnEw@mail.gmail.com
*	Provide a way block-level table AMs could re-use acquire_sample_rows()	Alexander Korotkov	2024-04-08
\| \| \| \| \| \| \| \| \| \|	While keeping API the same, this commit provides a way for block-level table AMs to re-use existing acquire_sample_rows() by providing custom callbacks for getting the next block and the next tuple. Reported-by: Andres Freund Discussion: https://postgr.es/m/20240407214001.jgpg5q3yv33ve6y3%40awork3.anarazel.de Reviewed-by: Pavel Borisov
*	Fill CommonRdOptions with default values in extract_autovac_opts()	Alexander Korotkov	2024-04-08
\| \| \| \| \| \|	Reported-by: Thomas Munro Reported-by: Pavel Borisov Discussion: https://postgr.es/m/CA%2BhUKGLZzLR50RBvuqOO3MZ%3DF54ETz-rTp1PDX9uDGP_GqyYqA%40mail.gmail.com
*	Custom reloptions for table AM	Alexander Korotkov	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Let table AM define custom reloptions for its tables. This allows specifying AM-specific parameters by the WITH clause when creating a table. The reloptions, which could be used outside of table AM, are now extracted into the CommonRdOptions data structure. These options could be by decision of table AM directly specified by a user or calculated in some way. The new test module test_tam_options evaluates the ability to set up custom reloptions and calculate fields of CommonRdOptions on their base. The code may use some parts from prior work by Hao Wu. Discussion: https://postgr.es/m/CAPpHfdurb9ycV8udYqM%3Do0sPS66PJ4RCBM1g-bBpvzUfogY0EA%40mail.gmail.com Discussion: https://postgr.es/m/AMUA1wBBBxfc3tKRLLdU64rb.1.1683276279979.Hmail.wuhao%40hashdata.cn Reviewed-by: Reviewed-by: Pavel Borisov, Matthias van de Meent, Jess Davis
*	Use bump context for TID bitmaps stored by vacuum	John Naylor	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Vacuum does not pfree individual entries, and only frees the entire storage space when finished with it. This allows using a bump context, eliminating the chunk header in each leaf allocation. Most leaf allocations will be 16 to 32 bytes, so that's a significant savings. TidStoreCreateLocal gets a boolean parameter to indicate that the created store is insert-only. This requires a separate tree context for iteration, since we free the iteration state after iteration completes. Discussion: https://postgr.es/m/CANWCAZac%3DpBePg3rhX8nXkUuaLoiAJJLtmnCfZsPEAS4EtJ%3Dkg%40mail.gmail.com Discussion: https://postgr.es/m/CANWCAZZQFfxvzO8yZHFWtQV+Z2gAMv1ku16Vu7KWmb5kZQyd1w@mail.gmail.com
*	Remove references to old function name	Andres Freund	2024-04-07
\| \| \| \| \| \| \| \| \| \| \|	In a97bbe1f1df I accidentally referenced heapgetpage(), both in a function name and a comment. But since 44086b09753 the relevant function is named heap_prepare_pagescan(). Rename the new function to page_collect_tuples(). Reported-by: Melanie Plageman <melanieplageman@gmail.com> Reported-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20240407172615.cocrsvboqm3ttqe4@awork3.anarazel.de Discussion: https://postgr.es/m/CAApHDvp4SniHopTrVeKWcEvNXFtdki0utAvO=5R7H6TNhtULRQ@mail.gmail.com
*	Fix alignment of stack variable	John Naylor	2024-04-08
\| \| \| \| \| \| \| \|	Declare with union similar to PGAlignedBlock. Report and fix by Andres Freund Discussion: https://postgr.es/m/20240407190731.izm3mdazednrsiqk%40awork3.anarazel.de
*	Remove redundant nbtree preprocessing assertions.	Peter Geoghegan	2024-04-07
\| \| \| \| \| \| \| \|	One of the assertions was the subject of a false positive complaint from Coverity, but none of the assertions added much, so get rid of them. Reported-By: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/3000247.1712537309@sss.pgh.pa.us
*	Use streaming I/O in ANALYZE.	Thomas Munro	2024-04-08
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ANALYZE command prefetches and reads sample blocks chosen by a BlockSampler algorithm. Instead of calling [Prefetch\|Read]Buffer() for each block, ANALYZE now uses the streaming API introduced in b5a9b18cd0. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Jakub Wartak <jakub.wartak@enterprisedb.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAN55FZ0UhXqk9v3y-zW_fp4-WCp43V8y0A72xPmLkOM%2B6M%2BmJg%40mail.gmail.com
*	Use conditional variable to wait for next MultiXact offset	Alvaro Herrera	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In one multixact.c edge case, we need a mechanism to wait for one multixact offset to be written before being allowed to read the next one. We used to handle this case by sleeping for one millisecond and retrying, but such sleeps have been reported as problematic in production cases. We can avoid the problem by using a condition variable: readers sleep on it and then every creator of multixacts broadcasts into the CV when creation is sufficiently far along. Author: Kyotaro Horiguchi <horikyotajntt@gmail.com> Reviewed-by: Andrey Borodin <amborodin@acm.org> Discussion: https://postgr.es/m/47A598F4-B4E7-4029-8FEC-A06A6C3CB4B5@yandex-team.ru Discussion: https://postgr.es/m/20200515.090333.24867479329066911.horikyota.ntt
*	Avoid extra lookups with nbtree array inequalities.	Peter Geoghegan	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	nbtree index scans with SAOP inequalities (but no SAOP equalities) performed extra ORDER proc lookups for any remaining equality strategy scan keys. This could waste cycles, and caused assertion failures. Keeping around a separate ORDER proc is only necessary for a scan's non-array/non-SAOP equality scan keys when the scan has at least one other SAOP equality strategy key (a SAOP inequality shouldn't count). To fix, replace _bt_preprocess_array_keys_final's assertion with a test that makes the function return early when the scan has no SAOP equality scan keys. Oversight in commit 1b134ca5, which enhanced nbtree ScalarArrayOp execution. Reported-By: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/0539d3d3-a402-0a49-ed5e-26429dffc4bd@gmail.com
*	Use streaming I/O in sequential scans.	Thomas Munro	2024-04-08
\| \| \| \| \| \| \| \| \| \|	Instead of calling ReadBuffer() for each block, heap sequential scans and TID range scans now use the streaming API introduced in b5a9b18cd0. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Discussion: https://postgr.es/m/flat/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg%3DgEQ%40mail.gmail.com
*	Add XLogCtl->logInsertResult	Alvaro Herrera	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This tracks the position of WAL that's been fully copied into WAL buffers by all processes emitting WAL. (For some reason we call that "WAL insertion"). This is updated using atomic monotonic advance during WaitXLogInsertionsToFinish, which is not when the insertions actually occur, but it's the only place where we know where have all the insertions have completed. This value is useful in WALReadFromBuffers, which can verify that callers don't try to read past what has been inserted. (However, more infrastructure is needed in order to actually use WAL after the flush point, since it could be lost.) The value is also useful in WaitXLogInsertionsToFinish() itself, since we can now exit quickly when all WAL has been already inserted, without even having to take any locks.
*	Reduce branches in heapgetpage()'s per-tuple loop	Andres Freund	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Until now, heapgetpage()'s loop over all tuples performed some conditional checks for each tuple, even though condition did not change across the loop. This commit fixes that by moving the loop into an inline function. By calling it with different constant arguments, the compiler can generate an optimized loop for the different conditions, at the price of two per-page checks. For cases of all-visible tables and an isolation level other than serializable, speedups of up to 25% have been measured. Reviewed-by: John Naylor <johncnaylorls@gmail.com> Reviewed-by: Zhang Mingli <zmlpostgres@gmail.com> Tested-by: Quan Zongliang <quanzongliang@yeah.net> Discussion: https://postgr.es/m/20230716015656.xjvemfbp5fysjiea@awork3.anarazel.de Discussion: https://postgr.es/m/2ef7ff1b-3d18-2283-61b1-bbd25fc6c7ce@yeah.net
*	Optimize visibilitymap_count() with AVX-512 instructions.	Nathan Bossart	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 792752af4e added infrastructure for using AVX-512 intrinsic functions, and this commit uses that infrastructure to optimize visibilitymap_count(). Specificially, a new pg_popcount_masked() function is introduced that applies a bitmask to every byte in the buffer prior to calculating the population count, which is used to filter out the all-visible or all-frozen bits as needed. Platforms without AVX-512 support should also see a nice speedup due to the reduced number of calls to a function pointer. Co-authored-by: Ants Aasma Discussion: https://postgr.es/m/BL1PR11MB5304097DF7EA81D04C33F3D1DCA6A%40BL1PR11MB5304.namprd11.prod.outlook.com
*	BitmapHeapScan: Push skip_fetch optimization into table AM	Tomas Vondra	2024-04-07
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 7c70996ebf0949b142 introduced an optimization to allow bitmap scans to operate like index-only scans by not fetching a block from the heap if none of the underlying data is needed and the block is marked all visible in the visibility map. With the introduction of table AMs, a FIXME was added to this code indicating that the skip_fetch logic should be pushed into the table AM-specific code, as not all table AMs may use a visibility map in the same way. This commit resolves this FIXME for the current block. The layering violation is still present in BitmapHeapScans's prefetching code, which uses the visibility map to decide whether or not to prefetch a block. However, this can be addressed independently. Author: Melanie Plageman Reviewed-by: Andres Freund, Heikki Linnakangas, Tomas Vondra, Mark Dilger Discussion: https://postgr.es/m/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com
*	Call WaitLSNCleanup() in AbortTransaction()	Alexander Korotkov	2024-04-07
\| \| \| \| \| \| \| \| \| \| \|	Even though waiting for replay LSN happens without explicit transaction, AbortTransaction() is responsible for the cleanup of the shared memory if the error is thrown in a stored procedure. So, we need to do WaitLSNCleanup() there to clean up after some unexpected error happened while waiting for replay LSN. Discussion: https://postgr.es/m/202404051815.eri4u5q6oj26%40alvherre.pgsql Author: Alvaro Herrera
*	Enhance nbtree ScalarArrayOp execution.	Peter Geoghegan	2024-04-06
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 9e8da0f7 taught nbtree to handle ScalarArrayOpExpr quals natively. This works by pushing down the full context (the array keys) to the nbtree index AM, enabling it to execute multiple primitive index scans that the planner treats as one continuous index scan/index path. This earlier enhancement enabled nbtree ScalarArrayOp index-only scans. It also allowed scans with ScalarArrayOp quals to return ordered results (with some notable restrictions, described further down). Take this general approach a lot further: teach nbtree SAOP index scans to decide how to execute ScalarArrayOp scans (when and where to start the next primitive index scan) based on physical index characteristics. This can be far more efficient. All SAOP scans will now reliably avoid duplicative leaf page accesses (just like any other nbtree index scan). SAOP scans whose array keys are naturally clustered together now require far fewer index descents, since we'll reliably avoid starting a new primitive scan just to get to a later offset from the same leaf page. The scan's arrays now advance using binary searches for the array element that best matches the next tuple's attribute value. Required scan key arrays (i.e. arrays from scan keys that can terminate the scan) ratchet forward in lockstep with the index scan. Non-required arrays (i.e. arrays from scan keys that can only exclude non-matching tuples) "advance" without the process ever rolling over to a higher-order array. Naturally, only required SAOP scan keys trigger skipping over leaf pages (non-required arrays cannot safely end or start primitive index scans). Consequently, even index scans of a composite index with a high-order inequality scan key (which we'll mark required) and a low-order SAOP scan key (which we won't mark required) now avoid repeating leaf page accesses -- that benefit isn't limited to simpler equality-only cases. In general, all nbtree index scans now output tuples as if they were one continuous index scan -- even scans that mix a high-order inequality with lower-order SAOP equalities reliably output tuples in index order. This allows us to remove a couple of special cases that were applied when building index paths with SAOP clauses during planning. Bugfix commit 807a40c5 taught the planner to avoid generating unsafe path keys: path keys on a multicolumn index path, with a SAOP clause on any attribute beyond the first/most significant attribute. These cases are now all safe, so we go back to generating path keys without regard for the presence of SAOP clauses (just like with any other clause type). Affected queries can now exploit scan output order in all the usual ways (e.g., certain "ORDER BY ... LIMIT n" queries can now terminate early). Also undo changes from follow-up bugfix commit a4523c5a, which taught the planner to produce alternative index paths, with path keys, but without low-order SAOP index quals (filter quals were used instead). We'll no longer generate these alternative paths, since they can no longer offer any meaningful advantages over standard index qual paths. Affected queries thereby avoid all of the disadvantages that come from using filter quals within index scan nodes. They can avoid extra heap page accesses from using filter quals to exclude non-matching tuples (index quals will never have that problem). They can also skip over irrelevant sections of the index in more cases (though only when nbtree determines that starting another primitive scan actually makes sense). There is a theoretical risk that removing restrictions on SAOP index paths from the planner will break compatibility with amcanorder-based index AMs maintained as extensions. Such an index AM could have the same limitations around ordered SAOP scans as nbtree had up until now. Adding a pro forma incompatibility item about the issue to the Postgres 17 release notes seems like a good idea. Author: Peter Geoghegan <pg@bowt.ie> Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas.vondra@enterprisedb.com> Discussion: https://postgr.es/m/CAH2-Wz=ksvN_sjcnD1+Bt-WtifRA5ok48aDYnq3pkKhxgMQpcw@mail.gmail.com
*	Operate XLogCtl->log{Write,Flush}Result with atomics	Alvaro Herrera	2024-04-05
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This removes the need to hold both the info_lck spinlock and WALWriteLock to update them. We use stock atomic write instead, with WALWriteLock held. Readers can use atomic read, without any locking. This allows for some code to be reordered: some places were a bit contorted to avoid repeated spinlock acquisition, but that's no longer a concern, so we can turn them to more natural coding. Some further changes are possible (maybe to performance wins), but in this commit I did rather minimal ones only, to avoid increasing the blast radius. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier versions) Discussion: https://postgr.es/m/20200831182156.GA3983@alvherre.pgsql
*	Secondary refactor of heap scanning functions	David Rowley	2024-04-04
\| \| \| \| \| \| \| \|	Similar to 44086b097, refactor heap scanning functions to be more suitable for the read stream API. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
*	Preliminary refactor of heap scanning functions	David Rowley	2024-04-04
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To allow the use of the read stream API added in b5a9b18cd for sequential scans on heap tables, here we make some adjustments to make that change less invasive and perhaps make the code easier to follow in the process. Here heapgetpage() gets broken into two functions: 1) The part which reads the block has now been moved into a function named heapfetchbuf(). 2) The part which performed pruning and populated the scan's rs_vistuples[] array is now moved into a new function named heap_prepare_pagescan(). The functionality provided by heap_prepare_pagescan() was only ever required by SO_ALLOW_PAGEMODE scans, so the branching that was previously done in heapgetpage() is no longer needed as we simply just don't call heap_prepare_pagescan() from heapgettup() in the refactored code. Author: Melanie Plageman Discussion: https://postgr.es/m/CAAKRu_YtXJiYKQvb5JsA2SkwrsizYLugs4sSOZh3EAjKUg=gEQ@mail.gmail.com
*	Invent SERIALIZE option for EXPLAIN.	Tom Lane	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	EXPLAIN (ANALYZE, SERIALIZE) allows collection of statistics about the volume of data emitted by a query, as well as the time taken to convert the data to the on-the-wire format. Previously there was no way to investigate this without actually sending the data to the client, in which case network transmission costs might swamp what you wanted to see. In particular this feature allows investigating the costs of de-TOASTing compressed or out-of-line data during formatting. Stepan Rutz and Matthias van de Meent, reviewed by Tomas Vondra and myself Discussion: https://postgr.es/m/ca0adb0e-fa4e-c37e-1cd7-91170b18cae1@gmx.de
*	Split XLogCtl->LogwrtResult into separate struct members	Alvaro Herrera	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After this change we have XLogCtl->logWriteResult and ->logFlushResult. There's no functional change, other than the fact that the assignment from shared memory to local is no longer done via struct assignment, but instead using a macro that copies each member separately. The current representation is inconvenient going forward; notably, we would like to add a new member "Copy" (to keep track of the last position copied into WAL buffers), so the symmetry between the values in shared memory vs. those in local would be lost. This also gives us freedom to later change the concurrency model for the values in shared memory: we can make them use atomics instead of relying on the info_lck spinlock. Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/202404031119.cd2kugjk2vho@alvherre.pgsql
*	Combine freezing and pruning steps in VACUUM	Heikki Linnakangas	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Execute both freezing and pruning of tuples in the same heap_page_prune() function, now called heap_page_prune_and_freeze(), and emit a single WAL record containing all changes. That reduces the overall amount of WAL generated. This moves the freezing logic from vacuumlazy.c to the heap_page_prune_and_freeze() function. The main difference in the coding is that in vacuumlazy.c, we looked at the tuples after the pruning had already happened, but in heap_page_prune_and_freeze() we operate on the tuples before pruning. The heap_prepare_freeze_tuple() function is now invoked after we have determined that a tuple is not going to be pruned away. VACUUM no longer needs to loop through the items on the page after pruning. heap_page_prune_and_freeze() does all the work. It now returns the list of dead offsets, including existing LP_DEAD items, to the caller. Similarly it's now responsible for tracking 'all_visible', 'all_frozen', and 'hastup' on the caller's behalf. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Refactor how heap_prune_chain() updates prunable_xid	Heikki Linnakangas	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In preparation of freezing and counting tuples which are not candidates for pruning, split heap_prune_record_unchanged() into multiple functions, depending the kind of line pointer. That's not too interesting right now, but makes the next commit smaller. Recording the lowest soon-to-be prunable xid is one of the actions we take for unchanged LP_NORMAL item pointers but not for others, so move that to the new heap_prune_record_unchanged_lp_normal() function. The next commit will add more actions to these functions. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Use the pairing heap instead of a flat array for LSN replay waiters	Alexander Korotkov	2024-04-03
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	06c418e163 introduced pg_wal_replay_wait() procedure allowing to wait for the particular LSN to be replayed on standby. The waiters were stored in the flat array. Even though scanning small arrays is fast, that might be a problem at scale (a lot of waiting processes). This commit replaces the flat shared memory array with the pairing heap, which holds the waiter with the least LSN at the top. This gives us O(log N) complexity for both inserting and removing waiters. Reported-by: Alvaro Herrera Discussion: https://postgr.es/m/202404030658.hhj3vfxeyhft%40alvherre.pgsql
*	Add error codes to some PANIC/FATAL errors reports	Daniel Gustafsson	2024-04-03
\| \| \| \| \| \| \| \| \| \| \|	This adds errcodes to a set of PANIC and FATAL errors in xlog.c and relcache.c, which previously had no errcode at all set, in order to make fleetwide analysis of errorlogs easier. There are many more ereport/elogs left which could benefit from having an errcode but this at least makes a dent in the issue. Author: Nazir Bilal Yavuz <byavuz81@gmail.com> Discussion: https://postgr.es/m/CAN55FZ1k8LgLEqncPGmz_fWnrobV6bjABOTH4tOWta6xNcPQig@mail.gmail.com
*	Implement pg_wal_replay_wait() stored procedure	Alexander Korotkov	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pg_wal_replay_wait() is to be used on standby and specifies waiting for the specific WAL location to be replayed before starting the transaction. This option is useful when the user makes some data changes on primary and needs a guarantee to see these changes on standby. The queue of waiters is stored in the shared memory array sorted by LSN. During replay of WAL waiters whose LSNs are already replayed are deleted from the shared memory array and woken up by setting of their latches. pg_wal_replay_wait() needs to wait without any snapshot held. Otherwise, the snapshot could prevent the replay of WAL records implying a kind of self-deadlock. This is why it is only possible to implement pg_wal_replay_wait() as a procedure working in a non-atomic context, not a function. Catversion is bumped. Discussion: https://postgr.es/m/eb12f9b03851bb2583adab5df9579b4b%40postgrespro.ru Author: Kartyshov Ivan, Alexander Korotkov Reviewed-by: Michael Paquier, Peter Eisentraut, Dilip Kumar, Amit Kapila Reviewed-by: Alexander Lakhin, Bharath Rupireddy, Euler Taveira
*	Revert "Custom reloptions for table AM"	Alexander Korotkov	2024-04-02
\| \| \| \| \| \| \| \|	This reverts commit c95c25f9af4bc77f2f66a587735c50da08c12b37 due to multiple design issues spotted after commit. Reported-by: Jeff Davis Discussion: https://postgr.es/m/11550b536211d5748bb2865ed6cb3502ff073bf7.camel%40j-davis.com
*	Use TidStore for dead tuple TIDs storage during lazy vacuum.	Masahiko Sawada	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, we used a simple array for storing dead tuple IDs during lazy vacuum, which had a number of problems: * The array used a single allocation and so was limited to 1GB. * The allocation was pessimistically sized according to table size. * Lookup with binary search was slow because of poor CPU cache and branch prediction behavior. This commit replaces that array with the TID store from commit 30e144287a. Since the backing radix tree makes small allocations as needed, the 1GB limit is now gone. Further, the total memory used is now often smaller by an order of magnitude or more, depending on the distribution of blocks and offsets. These two features should make multiple rounds of heap scanning and index cleanup an extremely rare event. TID lookup during index cleanup is also several times faster, even more so when index order is correlated with heap tuple order. Since there is no longer a predictable relationship between the number of dead tuples vacuumed and the space taken up by their TIDs, the number of tuples no longer provides any meaningful insights for users, nor is the maximum number predictable. For that reason this commit also changes to byte-based progress reporting, with the relevant columns of pg_stat_progress_vacuum renamed accordingly to max_dead_tuple_bytes and dead_tuple_bytes. For parallel vacuum, both the TID store and supplemental information specific to vacuum are shared among the parallel vacuum workers. As with the previous array, we don't take any locks on TidStore during parallel vacuum since writes are still only done by the leader process. Bump catalog version. Reviewed-by: John Naylor, (in an earlier version) Dilip Kumar Discussion: https://postgr.es/m/CAD21AoAfOZvmfR0j8VmZorZjL7RhTiQdVttNuC4W-Shdc2a-AA%40mail.gmail.com
*	Introduce 'options' argument to heap_page_prune()	Heikki Linnakangas	2024-04-02
\| \| \| \| \| \| \| \| \| \| \| \|	Currently there is only one option, HEAP_PAGE_PRUNE_MARK_UNUSED_NOW which replaces the old boolean argument, but upcoming patches will introduce at least one more. Having a lot of boolean arguments makes it hard to see at the call sites what the arguments mean, so prefer a bitmask of options with human-readable names. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://www.postgresql.org/message-id/20240401172219.fngjosaqdgqqvg4e@liskov
*	Handle non-chain tuples outside of heap_prune_chain()	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Handle dead branches of aborted HOT chains outside heap_prune_chain() as a separate phase. This simplifies the logic in heap_prune_chain(), as well as allowing us to clean up more RECENTLY_DEAD -> DEAD chains. To accomplish this efficiently, partition tuples into HOT and non-HOT while first collecting visibility information for each tuple in heap_page_prune(). Then call heap_prune_chain() only on potential chain members. Then mop up the leftover HOT tuples afterwards. As part of this, keep track of which items on page have already been processed, in 'processed' array. This replaces the 'marked' array which was only set for tuples marked for removal or redirection. The 'processed' array is updated also for items that are left unchanged, when we conclude that an item can be left unchanged. At the end of pruning, every item on the page should be marked as processed in the array; an assertion is added for that. Author: Melanie Plageman <melanieplageman@gmail.com> Author: Heikki Linnakangas <heikki.linnakangas@iki.fi> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Refactor heap_prune_chain()	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Keep track of the number of deleted tuples in PruneState and record this information when recording a tuple dead, unused or redirected. This removes a special case from the traversal and chain processing logic as well as setting a precedent of recording the impact of prune actions in the record functions themselves. This paradigm will be used in future commits which move tracking of additional statistics on pruning actions from lazy_scan_prune() to heap_prune_chain(). Simplify heap_prune_chain()'s chain traversal logic by handling each case explicitly. That is, do not attempt to share code when processing different types of chains. For each category of chain, process it specifically and procedurally: first handling the root, then any intervening tuples, and, finally, the end of the chain. While we are at it, add a few new comments to heap_prune_chain() clarifying some special cases involving RECENTLY_DEAD tuples. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://www.postgresql.org/message-id/20240330055710.kqg6ii2cdojsxgje@liskov
*	Minor refactoring in heap_page_prune	Heikki Linnakangas	2024-04-01
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pass 'page', 'blockno' and 'maxoff' to heap_prune_chain() as arguments, so that it doesn't need to fetch them from the buffer. This saves a few cycles per chain. Remove the "if (off_loc != NULL)" checks, and require the caller to pass a non-NULL 'off_loc'. Pass a pointer to a dummy local variable when it's not needed. Those checks are cheap, but it's still better to avoid them in the per-chain loops when we can do so easily. The CPU time saving from these changes are hardly measurable, but fewer instructions is good anyway, so why not. I spotted the potential for these while reviewing Melanie Plageman's patch set to combine prune and freeze records. Discussion: https://www.postgresql.org/message-id/CAAKRu_abm2tHhrc0QSQa%3D%3DsHe%3DVA1%3Doz1dJMQYUOKuHmu%2B9Xrg%40mail.gmail.com