aboutsummaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAge
...
* Remove GUC_NOT_IN_SAMPLE from enable_self_join_eliminationAlexander Korotkov2025-04-07
| | | | | | | | | | | | fc069a3a6319 implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This new GUC variable was marked as GUC_NOT_IN_SAMPLE. However, enable_self_join_elimination is documented and is not different from any other enable_* GUCs. Thus, remove GUC_NOT_IN_SAMPLE from it and add it to the postgresql.conf.sample. Discussion: https://postgr.es/m/CAPpHfdsqMTEsmxk3aQwt6xPz%2BKpUELO%3D6fzmER9ZRGrbs4uMfA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
* psql: Clarify help message for WATCH_INTERVALDaniel Gustafsson2025-04-07
| | | | | | | | | | | The help message for WATCH_INTERVAL was hard to interpret and didn't follow the style of other messages, this updates it to nake it fit in better and be easier to interpret. Author: Daniel Gustafsson <daniel@yesql.se> Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: David G. Johnston <david.g.johnston@gmail.com> Discussion: https://postgr.es/m/20250326.120732.1167093737847500721.horikyota.ntt@gmail.com
* Fix grammar in log message of pg_restore.cMichael Paquier2025-04-07
| | | | | | | Introduced by 1495eff7bdb0. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20250407.151359.72428746612514925.horikyota.ntt@gmail.com
* libpq: Fix some issues in TAP tests for service filesMichael Paquier2025-04-07
| | | | | | | | | | | | | | | | | | | | | | | The valid service file was not correctly shaped, as append_to_file() was called with an array as input. This is changed so as the parameter and value pairs from the valid connection string are appended to the valid service file one by one. Even with the first issue fixed, the tests should fail. However, they have been passing because all the connection attempts relied on the default values given to PGPORT and PGHOST from the node when using Cluster.pm's connect_ok() and connect_fails(), rather than the data in the service file. The test is updated to use an interesting trick: a dummy node is initialized but not started, and all the connection attempts are done through it. This ensures that the data inside the service file is used for all the connection tests. Note that breaking the contents of the valid service file on purpose makes all the tests that rely on it fail. Issues introduced by 72c2f36d5727. Author: Andrew Jackson <andrewjackson947@gmail.com> Discussion: https://postgr.es/m/CAKK5BkG_6_YSaebM6gG=8EuKaY7_VX1RFgYeySuwFPh8FZY73g@mail.gmail.com
* Clarify comment for worst-case allocation in quote_literal_cstr()Michael Paquier2025-04-07
| | | | | | | | | | | | palloc() is invoked with a specific formula for its allocation size in quote_literal_cstr(). This wastes some memory, but the size is large enough to cover even the worst-case scenarios. No explanations were given about the reasons behind these numbers. This commit adds more documentation about all that. Author: Steve Chavez <steve@supabase.io> Discussion: https://postgr.es/m/CAGRrpzZ9bToRWS+fAnjxDJrxwZN1QcJ-y1Pn2yg=Hst6rydLtw@mail.gmail.com
* Fix use-after-free in pgstat_fetch_stat_backend_by_pid()Michael Paquier2025-04-07
| | | | | | | | | | | | | | | | | | stats_fetch_consistency set to "snapshot" causes the backend entry "beentry" retrieved by pgstat_get_beentry_by_proc_number() to be reset at the beginning of pgstat_fetch_stat_backend() when fetching the backend pgstats entry. As coded, "beentry" was being accessed after being freed. This commit moves all the accesses to "beentry" to happen before calling pgstat_fetch_stat_backend(), fixing the problem. This problem could be reached by calling the SQL functions pg_stat_get_backend_io() or pg_stat_get_backend_wal(). Issue caught by valgrind. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Discussion: https://postgr.es/m/f1788cc0-253a-4a3a-aee0-1b8ab9538736@gmail.com
* Use XLOG_CONTROL_FILE macro consistently for control file name.Fujii Masao2025-04-07
| | | | | | | | | | | | | | | | | The XLOG_CONTROL_FILE macro (defined in access/xlog_internal.h) represents the control file name. While some parts of the codebase already use this macro, others previously hardcoded the file name as a string. This commit replaces those hardcoded strings with the macro, ensuring consistent usage throughout the code. This makes future maintenance easier and improves searchability, for example when grepping for control file usage. Author: Anton A. Melnikov <a.melnikov@postgrespro.ru> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Masao Fujii <masao.fujii@gmail.com> Discussion: https://postgr.es/m/0841ec77-47e5-452a-adb4-c6fa55d605fc@postgrespro.ru
* Clean up checking for pg_dumpall output directoryAndrew Dunstan2025-04-06
| | | | | | | | Coverity objected to the original code, and in any case this is much cleaner, using the existing routine pg_check_dir() instead of rolling its own test. Per suggestion from Tom Lane.
* pg_upgrade: Fix memory leak in check_for_unicode_update().Nathan Bossart2025-04-06
| | | | | | | | | | This function was initializing the "task" variable before a couple of early returns. To fix, postpone the initialization until just before it's needed. Per Coverity. Discussion: https://postgr.es/m/Z_KMsUH2-FEbiNjC%40nathan
* aio: Avoid spurious coverity warningAndres Freund2025-04-06
| | | | | | | | | | PgAioResult.result is never accessed in the relevant path, but coverity complains about an uninitialized access anyway. So just zero-initialize the whole thing. While at it, reduce the scope of the variable. Reported-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CAEudQApsKqd-s+fsUQ0OmxJAMHmBSXxrAz3dCs+uvqb3iRtjSw@mail.gmail.com
* Fix memory leaks in px_crypt_shacrypt().Tom Lane2025-04-06
| | | | | | | | Per Coverity. I don't think these are of any actual significance since the function ought to be invoked in a short-lived context. Still, if it's trying to be neat it should get it right. Also const-ify a constant and fix up typedef formatting.
* Use "(void)" to mark pgstat_lock_entry(..., false) calls.Tom Lane2025-04-06
| | | | | | | | | | This should silence Coverity's complaints about the result being sometimes ignored. I'm inclined to think that these routines are simply misdesigned, because sometimes it's okay to ignore the result and sometimes it isn't, and we have no way to enforce the latter. But for now I just added a comment.
* Avoid unnecessary copying of a string in pg_restore.cAndrew Dunstan2025-04-06
| | | | | Coverity complained about a possible overrun in the copy, but there is no actual need to copy the string at all.
* Fix a couple of memory leaks in pg_restore.cAndrew Dunstan2025-04-06
| | | | per complaint from Coverity.
* Relax ordering-related hardcoded btree requirements in planningPeter Eisentraut2025-04-06
| | | | | | | | | | | | | | | | | | | | There were several places in ordering-related planning where a requirement for btree was hardcoded but an amcanorder index could suffice. This fixes that. We just need to do the necessary mapping between strategy numbers and compare types and adjust some related APIs so that this works independent of btree strategy numbers. For instance, non-btree amcanorder indexes can now be used to support sorting and merge joins. Also, predtest.c works independent of btree strategy numbers now. To avoid performance regressions, some details on btree and other built-in index types are still hardcoded as shortcuts, but other index types now have access to the same features by providing the required flags and callbacks. Author: Mark Dilger <mark.dilger@enterprisedb.com> Co-authored-by: Peter Eisentraut <peter@eisentraut.org> Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com
* Revert "Put enable_self_join_elimination into postgresql.conf.sample"Alexander Korotkov2025-04-06
| | | | | | | This reverts commit c2d329260cd8. Reported-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/D292EB44-806E-439A-82A4-491A1BA59E7A%40yesql.se
* Put enable_self_join_elimination into postgresql.conf.sampleAlexander Korotkov2025-04-06
| | | | | | | | | | fc069a3a6319 implements Self-Join Elimination (SJE) and provides a new GUC variable: enable_self_join_elimination. This commit adds enable_self_join_elimination to the postgresql.conf.sample, as it was forgotten in the original commit. Discussion: https://postgr.es/m/CAHewXN%3D%2Bghd6O6im46q7j2u6c3H6vkXtXmF%3D_v4CfGSnjje8PA%40mail.gmail.com Author: Tender Wang <tndrwang@gmail.com>
* Compute CRC32C using AVX-512 instructions where availableJohn Naylor2025-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous implementation of CRC32C on x86 relied on the native CRC32 instruction from the SSE 4.2 extension, which operates on up to 8 bytes at a time. We can get a substantial speedup by using carryless multiplication on SIMD registers, processing 64 bytes per loop iteration. Shorter inputs fall back to ordinary CRC instructions. On Intel Tiger Lake hardware (2020), CRC is now 50% faster for inputs between 64 and 112 bytes, and 3x faster for 256 bytes. The VPCLMULQDQ instruction on 512-bit registers has been available on Intel hardware since 2019 and AMD since 2022. There is an older variant for 128-bit registers, but at least on Zen 2 it performs worse than normal CRC instructions for short inputs. We must now do a runtime check, even for builds that target SSE 4.2. This doesn't matter in practice for WAL (arguably the most critical case), because since commit e2809e3a1 the final computation with the 20-byte WAL header is inlined and unrolled when targeting that extension. Compared with two direct function calls, testing showed equal or slightly faster performance in performing an indirect function call on several dozen bytes followed by inlined instructions on constant input of 20 bytes. The MIT-licensed implementation was generated with the "generate" program from https://github.com/corsix/fast-crc32/ Based on: "Fast CRC Computation for Generic Polynomials Using PCLMULQDQ Instruction" V. Gopal, E. Ozturk, et al., 2009 Co-authored-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com> Co-authored-by: Paul Amonson <paul.d.amonson@intel.com> Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> (earlier version) Reviewed-by: Matthew Sterrett <matthewsterrett2@gmail.com> (earlier version) Tested-by: Raghuveer Devulapalli <raghuveer.devulapalli@intel.com> Tested-by: David Rowley <<dgrowleyml@gmail.com>> (earlier version) Discussion: https://postgr.es/m/BL1PR11MB530401FA7E9B1CA432CF9DC3DC192@BL1PR11MB5304.namprd11.prod.outlook.com Discussion: https://postgr.es/m/PH8PR11MB82869FF741DFA4E9A029FF13FBF72@PH8PR11MB8286.namprd11.prod.outlook.com
* Quote filename in error messageDaniel Gustafsson2025-04-05
| | | | | | | | | | Project standard is to quote filenames in error and log messages, which commit 2da74d8d640 missed in two error messages. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Discussion: https://postgr.es/m/20250404.120328.103562371975971823.horikyota.ntt@gmail.com
* Fix parse_cte.c's failure to examine sub-WITHs in DML statements.Tom Lane2025-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | makeDependencyGraphWalker thought that only SelectStmt nodes could contain a WithClause. Which was true in our original implementation of WITH, but astonishingly we missed updating this code when we added the ability to attach WITH to INSERT/UPDATE/DELETE (and later MERGE). Moreover, since it was coded to deliberately block recursion to a WithClause, even updating raw_expression_tree_walker didn't save it. The upshot of this was that we didn't see references to outer CTE names appearing within an inner WITH, and would neither complain about disallowed recursion nor account for such references when sorting CTEs into a usable order. The lack of complaints about this is perhaps not so surprising, because typical usage of WITH wouldn't hit either case. Still, it's pretty broken; failing to detect recursion here leads to assert failures or worse later on. Fix by factoring out the processing of sub-WITHs into a new function WalkInnerWith, and invoking that for all the statement types that can have WITH. Bug: #18878 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18878-a26fa5ab6be2f2cf@postgresql.org Backpatch-through: 13
* Avoid double transformation of json_array()'s subquery.Tom Lane2025-04-05
| | | | | | | | | | | | | | | | | | | | | | | transformJsonArrayQueryConstructor() applied transformStmt() to the same subquery tree twice. While this causes no issue in many cases, there are some where it causes a coredump, thanks to the parser's habit of scribbling on its input. Fix by making a copy before the first transformation (compare 0f43083d1). This is quite brute-force, but then so is the whole business of transforming the input twice. Per discussion in the bug thread, this implementation of json_array() parsing should be replaced completely. But that will take some work and will surely not be back-patchable, so for the moment let's take the easy way out. Oversight in 7081ac46a. Back-patch to v16 where that came in. Bug: #18877 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18877-c3c3ad75845833bb@postgresql.org Backpatch-through: 16
* Clean up from commit 1495eff7bdbAndrew Dunstan2025-04-05
| | | | | Fix some comments, and remove the hacky way of quoting database names in favor of appendStringLiteralConn.
* Set log_statement=none in t/002_pg_upgrade.plÁlvaro Herrera2025-04-05
| | | | | | | | | This should make the test a wee bit faster on high-load machines (e.g., when running under valgrind). Per complaint from Andres Freund. Discussion: https://postgr.es/m/cwbcyjp2ts7o7xgy5y5gwtcd4zltvncsj67el7xgci7xbwrhlu@k363vk5tce4g
* pg_dump: Tiny header cleanupÁlvaro Herrera2025-04-05
| | | | | | | | | | | | | | | In commits 9c02e3a986da and 8ec0aaeae094, Nathan added a duplicate TocEntry typedef forward declaration (plus assorted #ifdef hackery to avoid C99 preprocessor issues) to deal with some very old untidyness regarding DefnDumperPtr function prototype being located in pg_backup.h. But there's no reason to have the DefnDumperPtr typedef (and the accompanying DataDumperPtr typedef) in that file at all; they are better placed in pg_backup_archiver.h, the internal header, because they are only used internally. That also requires zero #ifdef hackery, so move them there. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/202504042140.qo66ggw6wzsz@alvherre.pgsql
* pg_dump: Fix query for gathering attribute stats on older versions.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | Commit 9c02e3a986 taught pg_dump to retrieve attribute statistics for 64 relations at a time. pg_dump supports dumping from v9.2 and newer versions, but our query for retrieving statistics for multiple relations uses WITH ORDINALITY and multi-argument UNNEST(), both of which were introduced in v9.4. To fix, we resort to gathering statistics for a single relation at a time on versions older than v9.4. Per buildfarm member crake. Author: Corey Huinker <corey.huinker@gmail.com> Discussion: https://postgr.es/m/Z_BcWVMvlUIJ_iuZ%40nathan
* Repair misbehavior with duplicate entries in FK SET column lists.Tom Lane2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since v15 we've had an option to apply a foreign key constraint's ON DELETE SET DEFAULT or SET NULL action to just some of the referencing columns. There was not a check for duplicate entries in the list of columns-to-set, though. That caused a potential memory stomp in CreateConstraintEntry(), which incautiously assumed that the list of columns-to-set couldn't be longer than the number of key columns. Even after fixing that, the case doesn't work because you get an error like "multiple assignments to same column" from the SQL command that is generated to do the update. We could either raise an error for duplicate columns or silently suppress the dups, and after a bit of thought I chose to do the latter. This is motivated by the fact that duplicates in the FK column list are legal, so it's not real clear why duplicates in the columns-to-set list shouldn't be. Of course there's no need to actually set the column more than once. I left in the fix in CreateConstraintEntry() too, just because it didn't seem like such low-level code ought to be making assumptions about what it's handed. Bug: #18879 Reported-by: Yu Liang <luy70@psu.edu> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/18879-259fc59d072bd4d7@postgresql.org Backpatch-through: 15
* functions.c: copy trees from source_list before parse analysis etc.Tom Lane2025-04-04
| | | | | | | | | | | | | | | | | | | | | | This is yet another bit of fallout from the fact that backend/parser (like other code) feels free to scribble on the parse tree it's handed. In this case that resulted in modifying the relatively-short-lived copy in the cached function's source_list. That would be fine since we only need each source_list tree once ... except that if the parser fails after making some changes, the function cache entry remains as-is and will still be there if the user tries to execute the function again. Then we have problems because we're feeding a non-pristine tree to the parser. The most expedient fix is a quick copyObject(). I considered other answers like somehow marking the cache entry invalid temporarily, but that would add complexity and I'm not sure it's worth it. In typical scenarios we'd only do this once per function query per session. Reported-by: Alexander Lakhin <exclusion@gmail.com> Author: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/6d442183-102c-498a-81d1-eeeb086cdc5a@gmail.com
* Fix a couple of error messages and tests for themAndrew Dunstan2025-04-04
| | | | oversights in 1495eff7bdb and 289f74d0cb2. Mea culpa.
* Prevent redeclaration of typedef TocEntry.Nathan Bossart2025-04-04
| | | | | | | | | Commit 9c02e3a986 added a forward declaration for this typedef that caused redeclarations, which is not valid in C99. To fix, add some preprocessor guards to avoid a redefinition, as is done elsewhere (e.g., commit 382092a0cd). Per buildfarm.
* Add more TAP tests for pg_dumpallAndrew Dunstan2025-04-04
| | | | | Author: Matheus Alcantara <matheusssilv97@gmail.com> Author: Mahendra Singh Thalor <mahi6run@gmail.com>
* Non text modes for pg_dumpall, correspondingly change pg_restoreAndrew Dunstan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | pg_dumpall acquires a new -F/--format option, with the same meanings as pg_dump. The default is p, meaning plain text. For any other value, a directory is created containing two files, globals.data and map.dat. The first contains SQL for restoring the global data, and the second contains a map from oids to database names. It will also contain a subdirectory called databases, inside which it will create archives in the specified format, named using the database oids. In these casess the -f argument is required. If pg_restore encounters a directory containing globals.dat, and no toc.dat, it restores the global settings and then restores each database. pg_restore acquires two new options: -g/--globals-only which suppresses restoration of any databases, and --exclude-database which inhibits restoration of particualr database(s) in the same way the same option works in pg_dumpall. Author: Mahendra Singh Thalor <mahi6run@gmail.com> Co-authored-by: Andrew Dunstan <andrew@dunslane.net> Reviewed-by: jian he <jian.universality@gmail.com> Reviewed-by: Srinath Reddy <srinath2133@gmail.com> Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/cb103623-8ee6-4ba5-a2c9-f32e3a4933fa@dunslane.net
* add new list type simple_oid_string_list to fe-utils/simple_listAndrew Dunstan2025-04-04
| | | | | | | | This type contains both an oid and a string. This will be used in forthcoming changes to pg_restore. Author: Andrew Dunstan <andrew@dunslane.net>
* Move common pg_dump code related to connections to a new fileAndrew Dunstan2025-04-04
| | | | | | | | | ConnectDatabase is used by pg_dumpall, pg_restore and pg_dump so move common code to new file. new file name: connectdb.c Author: Mahendra Singh Thalor <mahi6run@gmail.com>
* Remove unused function parameters in pg_backup_archiver.c.Nathan Bossart2025-04-04
| | | | | | | | | Thanks to commit 9c02e3a986, which modified some of the changes from commit a0a4601765, we can remove the now-unused ArchiveHandle parameter from _tocEntryRestorePass() and move_to_ready_heap(). Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z-3x2AnPCP331JA3%40nathan
* pg_dump: Retrieve attribute statistics in batches.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, pg_dump gathers attribute statistics with a query per relation, which can cause pg_dump to take significantly longer, especially when there are many relations. This commit addresses this by teaching pg_dump to gather attribute statistics for 64 relations at a time. Some simple tests showed this was the optimal batch size, but performance may vary depending on the workload. Our lookahead code determines the next batch of relations by searching the TOC sequentially for relevant entries. This approach assumes that we will dump all such entries in TOC order, which unfortunately isn't true for dump formats that use RestoreArchive(). RestoreArchive() does multiple passes through the TOC and selectively dumps certain groups of entries each time. This is particularly problematic for index stats and a subset of matview stats; both are in SECTION_POST_DATA, but matview stats that depend on matview data are dumped in RESTORE_PASS_POST_ACL, while all other stats are dumped in RESTORE_PASS_MAIN. To handle this, this commit moves all statistics data entries in SECTION_POST_DATA to RESTORE_PASS_POST_ACL, which ensures that we always dump them in TOC order. A convenient side effect of this change is that we can revert a decent chunk of commit a0a4601765, but that is left for a follow-up commit. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com
* pg_dump: Reduce memory usage of dumps with statistics.Nathan Bossart2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, pg_dump stores all generated commands for statistics in memory. These commands can be quite large and therefore can significantly increase pg_dump's memory footprint. To fix, wait until we are about to write out the commands before generating them, and be sure to free the commands after writing. This is implemented via a new defnDumper callback that works much like the dataDumper one but is specifically designed for TOC entries. Custom dumps that include data might write the TOC twice (to update data offset information), which would ordinarily cause pg_dump to run the attribute statistics queries twice. However, as a hack, we save the length of the written-out entry in the first pass and skip over it in the second. While there is no known technical issue with executing the queries multiple times and rewriting the results, it's expensive and feels risky, so let's avoid it. As an exception, we _do_ execute the queries twice for the tar format. This format does a second pass through the TOC to generate the restore.sql file. pg_restore doesn't use this file, so even if the second round of queries returns different results than the first, it won't corrupt the output; the archive and restore.sql file will just have different content. A follow-up commit will teach pg_dump to gather attribute statistics in batches, which our testing indicates more than makes up for the added expense of running the queries twice. Author: Corey Huinker <corey.huinker@gmail.com> Co-authored-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/CADkLM%3Dc%2Br05srPy9w%2B-%2BnbmLEo15dKXYQ03Q_xyK%2BriJerigLQ%40mail.gmail.com
* Skip second WriteToc() call for custom-format dumps without data.Nathan Bossart2025-04-04
| | | | | | | | | | | Presently, "pg_dump --format=custom" calls WriteToc() twice. The second call updates the data offset information, which allegedly makes parallel pg_restore significantly faster. However, if we're not dumping any data, there are no data offsets to update, so we can skip this step. Reviewed-by: Jeff Davis <pgsql@j-davis.com> Discussion: https://postgr.es/m/Z9c1rbzZegYQTOQE%40nathan
* Use streaming read I/O in autoprewarmMelanie Plageman2025-04-04
| | | | | | | | | | | | | | | | Make a read stream for each valid fork of each valid relation represented in the autoprewarm dump file and prewarm those blocks through the read stream API instead of by directly invoking ReadBuffer(). Co-authored-by: Nazir Bilal Yavuz <byavuz81@gmail.com> Co-authored-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Andrey M. Borodin <x4mmm@yandex-team.ru> (earlier versions) Reviewed-by: Kirill Reshke <reshkekirill@gmail.com> (earlier versions) Reviewed-by: Matheus Alcantara <mths.dev@pm.me> (earlier versions) Discussion: https://postgr.es/m/flat/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com
* Avoid extra index searches through preprocessing.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Transform low_compare and high_compare nbtree skip array inequalities (with opclasses that offer skip support) in such a way as to allow _bt_first to consistently apply later keys when it descends the tree. This can lower the number of index searches for multi-column scans that use a ">" key on one of the index's prefix columns (or use a "<" key, when scanning backwards) when it precedes some later lower-order key. For example, an index qual "WHERE a > 5 AND b = 2" will now be converted to "WHERE a >= 6 AND b = 2" by a new preprocessing step that takes place after low_compare and high_compare have been finalized. That way, the initial call to _bt_first can use "WHERE a >= 6 AND b = 2" to find an initial position, rather than just using "WHERE a > 5" -- "b = 2" can be applied during every _bt_first call. There's a decent chance that this will allow such a scan to avoid the extra search that might otherwise be needed to determine the lowest "a" value still satisfying "WHERE a > 5". The transformation process can only lower the total number of index pages read when the use of a more restrictive set of initial positioning keys in _bt_first actually allows the scan to land on some later leaf page directly, relative to the unoptimized case (or on an earlier leaf page directly, when scanning backwards). But the savings can really add up in cases where an affected skip array comes after some other array. For example, a scan indexqual "WHERE x IN (1, 2, 3) AND y > 5 AND z = 2" can save as many as 3 _bt_first calls by applying the new transformation to its "y" array (up to 1 extra search can be avoided per "x" element). Follow-up to commit 92fe23d9, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=FJ78K3WsF3iWNxWnUCY9f=Jdg3QPxaXE=uYUbmuRz5Q@mail.gmail.com
* Improve nbtree skip scan primitive scan scheduling.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't allow nbtree scans with skip arrays to end any primitive scan on its first leaf page without giving some consideration to how many times the scan's arrays advanced while changing at least one skip array (though continue not caring about the number of array advancements that only affected SAOP arrays, even during skip scans with SAOP arrays). Now when a scan performs more than 3 such array advancements in the course of reading a single leaf page, it is taken as a signal that the next page is unlikely to be skippable. We'll therefore continue the ongoing primitive index scan, at least until we can perform a recheck against the next page's finaltup. Testing has shown that this new heuristic occasionally makes all the difference with skip scans that were expected to rely on the "passed first page" heuristic added by commit 9a2e2a28. Without it, there is a remaining risk that certain kinds of skip scans will never quite manage to clear the initial hurdle of performing a primitive scan that lasts beyond its first leaf page (or that such a skip scan will only clear that initial hurdle when it has already wasted noticeably-many cycles due to inefficient primitive scan scheduling). Follow-up to commits 92fe23d9 and 9a2e2a28. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=RVdG3zWytFWBsyW7fWH7zveFvTHed5JKEsuTT0RCO_A@mail.gmail.com
* pg_recvlogical: Add --failover option.Masahiko Sawada2025-04-04
| | | | | | | | | | | This new option instructs pg_recvlogical to create the logical replication slot with the failover option enabled. It can be used in conjunction with the --create-slot option. Author: Hayato Kuroda <kuroda.hayato@fujitsu.com> Reviewed-by: Michael Banck <mbanck@gmx.net> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/OSCPR01MB14966C54097FC83AF19F3516BF5AC2@OSCPR01MB14966.jpnprd01.prod.outlook.com
* Oversight in commit b81ffa13e3.Jeff Davis2025-04-04
| | | | Should warn if a materialized view may be affected, as well.
* Further optimize nbtree search scan key comparisons.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Postgres 17 commit e0b1ee17 added two complementary optimizations to nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage was made to avoid needlessly evaluating keys that are guaranteed to be satisfied by applying page-level context. "prechecked" did this for keys required in the current scan direction, while "firstmatch" did it for keys required in the opposite-to-scan direction only. The "prechecked" design had a number of notable issues. It didn't account for the fact that an = array scan key's sk_argument field might need to advance at the point of the page precheck (it didn't check the precheck tuple against the key's array, only the key's sk_argument, which needlessly made it ineffective in cases involving stepping to a page having advanced the scan's arrays using a truncated high key). "prechecked" was also completely ineffective when only one scan key wasn't guaranteed to be satisfied by every tuple (it didn't recognize that it was still safe to avoid evaluating other, earlier keys). The "firstmatch" optimization had similar limitations. It could only be applied after _bt_readpage found its first matching tuple, regardless of why any earlier tuples failed to satisfy the scan's index quals. This allowed unsatisfied non-required scan keys to impede the optimization. Replace both optimizations with a new optimization, without any of these limitations: the "startikey" optimization. Affected _bt_readpage calls generate a page-level key offset ("startikey"), that their _bt_checkkeys calls can then start at. This is an offset to the first key that isn't known to be satisfied by every tuple on the page. Although this is independently useful work, its main goal is to avoid performance regressions with index scans that use skip arrays, but still never manage to skip over irrelevant leaf pages. We must avoid wasting CPU cycles on overly granular skip array maintenance in these cases. The new "startikey" optimization helps with this by selectively disabling array maintenance for the duration of a _bt_readpage call. This has no lasting consequences for the scan's array keys (they'll still reliably track the scan's progress through the index's key space whenever the scan is "between pages"). Skip scan adds skip arrays during preprocessing using simple, static rules, and decides how best to navigate/apply the scan's skip arrays dynamically, at runtime. The "startikey" optimization enables this approach. As a result of all this, the planner doesn't need to generate distinct, competing index paths (one path for skip scan, another for an equivalent traditional full index scan). The overall effect is to make scan runtime close to optimal, even when the planner works off an incorrect cardinality estimate. Scans will also perform well given a skipped column with data skew: individual groups of pages with many distinct values (in respect of a skipped column) can be read about as efficiently as before -- without the scan being forced to give up on skipping over other groups of pages that are provably irrelevant. Many scans that cannot possibly skip will still benefit from the use of skip arrays, since they'll allow the "startikey" optimization to be as effective as possible (by allowing preprocessing to mark all the scan's keys as required). A scan that uses a skip array on "a" for a qual "WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now, even when every tuple read by the scan has its own distinct "a" value. However, there are still some remaining regressions, affecting certain trickier cases. Scans whose index quals have several range skip arrays, each on some high cardinality column, can still be slower than they were before the introduction of skip scan -- even with the new "startikey" optimization. There are also known regressions affecting very selective index scans that use a skip array. The underlying issue with such selective scans is that they never get as far as reading a second leaf page, and so will never get a chance to consider applying the "startikey" optimization. In principle, all regressions could be avoided by teaching preprocessing to not add skip arrays whenever they aren't expected to help, but it seems best to err on the side of robust performance. Follow-up to commit 92fe23d9, which added nbtree skip scan. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Masahiro Ikeda <ikedamsh@oss.nttdata.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com
* Add nbtree skip scan optimization.Peter Geoghegan2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach nbtree multi-column index scans to opportunistically skip over irrelevant sections of the index given a query with no "=" conditions on one or more prefix index columns. When nbtree is passed input scan keys derived from a predicate "WHERE b = 5", new nbtree preprocessing steps output "WHERE a = ANY(<every possible 'a' value>) AND b = 5" scan keys. That is, preprocessing generates a "skip array" (and an output scan key) for the omitted prefix column "a", which makes it safe to mark the scan key on "b" as required to continue the scan. The scan is therefore able to repeatedly reposition itself by applying both the "a" and "b" keys. A skip array has "elements" that are generated procedurally and on demand, but otherwise works just like a regular ScalarArrayOp array. Preprocessing can freely add a skip array before or after any input ScalarArrayOp arrays. Index scans with a skip array decide when and where to reposition the scan using the same approach as any other scan with array keys. This design builds on the design for array advancement and primitive scan scheduling added to Postgres 17 by commit 5bf748b8. Testing has shown that skip scans of an index with a low cardinality skipped prefix column can be multiple orders of magnitude faster than an equivalent full index scan (or sequential scan). In general, the cardinality of the scan's skipped column(s) limits the number of leaf pages that can be skipped over. The core B-Tree operator classes on most discrete types generate their array elements with the help of their own custom skip support routine. This infrastructure gives nbtree a way to generate the next required array element by incrementing (or decrementing) the current array value. It can reduce the number of index descents in cases where the next possible indexable value frequently turns out to be the next value stored in the index. Opclasses that lack a skip support routine fall back on having nbtree "increment" (or "decrement") a skip array's current element by setting the NEXT (or PRIOR) scan key flag, without directly changing the scan key's sk_argument. These sentinel values behave just like any other value from an array -- though they can never locate equal index tuples (they can only locate the next group of index tuples containing the next set of non-sentinel values that the scan's arrays need to advance to). A skip array's range is constrained by "contradictory" inequality keys. For example, a skip array on "x" will only generate the values 1 and 2 given a qual such as "WHERE x BETWEEN 1 AND 2 AND y = 66". Such a skip array qual usually has near-identical performance characteristics to a comparable SAOP qual "WHERE x = ANY('{1, 2}') AND y = 66". However, improved performance isn't guaranteed. Much depends on physical index characteristics. B-Tree preprocessing is optimistic about skipping working out: it applies static, generic rules when determining where to generate skip arrays, which assumes that the runtime overhead of maintaining skip arrays will pay for itself -- or lead to only a modest performance loss. As things stand, these assumptions are much too optimistic: skip array maintenance will lead to unacceptable regressions with unsympathetic queries (queries whose scan can't skip over many irrelevant leaf pages). An upcoming commit will address the problems in this area by enhancing _bt_readpage's approach to saving cycles on scan key evaluation, making it work in a way that directly considers the needs of = array keys (particularly = skip array keys). Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiro Ikeda <masahiro.ikeda@nttdata.com> Reviewed-By: Heikki Linnakangas <heikki.linnakangas@iki.fi> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Aleksander Alekseev <aleksander@timescale.com> Reviewed-By: Alena Rybakina <a.rybakina@postgrespro.ru> Discussion: https://postgr.es/m/CAH2-Wzmn1YsLzOGgjAQZdn1STSG_y8qP__vggTaPAYXJP+G4bw@mail.gmail.com
* Stabilize regression test from c0962a113.Tom Lane2025-04-04
| | | | | | | | Per buildfarm. Co-authored-by: Alena Rybakina <a.rybakina@postgrespro.ru> Co-authored-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/srnuqlttuimzmvoulhsrbgvj4vnul6b65osswvua7sfkqsvmuy@yg7apybpxp34
* Re-pgindent pg_largeobject.c after commit 0d6c477664.Nathan Bossart2025-04-04
|
* Convert 'x IN (VALUES ...)' to 'x = ANY ...' then appropriateAlexander Korotkov2025-04-04
| | | | | | | | | | | | | | | | | | | | | | | | | | This commit implements the automatic conversion of 'x IN (VALUES ...)' into ScalarArrayOpExpr. That simplifies the query tree, eliminating the appearance of an unnecessary join. Since VALUES describes a relational table, and the value of such a list is a table row, the optimizer will likely face an underestimation problem due to the inability to estimate cardinality through MCV statistics. The cardinality evaluation mechanism can work with the array inclusion check operation. If the array is small enough (< 100 elements), it will perform a statistical evaluation element by element. We perform the transformation in the convert_ANY_sublink_to_join() if VALUES RTE is proper and the transformation is convertible. The conversion is only possible for operations on scalar values, not rows. Also, we currently support the transformation only when it ends up with a constant array. Otherwise, the evaluation of non-hashed SAOP might be slower than the corresponding Hash Join with VALUES. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
* Extract make_SAOP_expr() function from match_orclause_to_indexcol()Alexander Korotkov2025-04-04
| | | | | | | | | | | | | | | This commit extracts the code to generate ScalarArrayOpExpr on top of the list of expressions from match_orclause_to_indexcol() into a separate function make_SAOP_expr(). This function was extracted to be used in optimization for conversion of 'x IN (VALUES ...)' to 'x = ANY ...'. make_SAOP_expr() is placed in clauses.c file as only two additional headers were needed there compared with other places. Discussion: https://postgr.es/m/0184212d-1248-4f1f-a42d-f5cb1c1976d2%40tantorlabs.com Author: Alena Rybakina <a.rybakina@postgrespro.ru> Author: Andrei Lepikhov <lepihov@gmail.com> Reviewed-by: Ivan Kush <ivan.kush@tantorlabs.com> Reviewed-by: Alexander Korotkov <aekorotkov@gmail.com>
* Fix crash/valgrind errorPeter Eisentraut2025-04-04
| | | | | | Fix for commit 9ef1851685b: We have to skip indexes where sortopfamily is NULL. This takes the place of the previous btree check. Detected by valgrind on the buildfarm.
* Relax assertion in finding correct GiST parentHeikki Linnakangas2025-04-04
| | | | | | | | | | | | | | | | | | Commit 28d3c2ddcf introduced an assertion that if the memorized downlink location in the insertion stack isn't valid, the parent's LSN should've changed too. Turns out that was too strict. In gistFindCorrectParent(), if we walk right, we update the parent's block number and clear its memorized 'downlinkoffnum'. That triggered the assertion on next call to gistFindCorrectParent(), if the parent needed to be split too. Relax the assertion, so that it's OK if downlinkOffnum is InvalidOffsetNumber. Backpatch to v13-, all supported versions. The assertion was added in commit 28d3c2ddcf in v12. Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Tender Wang <tndrwang@gmail.com> Discussion: https://www.postgresql.org/message-id/18396-03cac9beb2f7aac3@postgresql.org