aboutsummaryrefslogtreecommitdiff
path: root/src
Commit message (Collapse)AuthorAge
* Empty search_path in logical replication apply worker and walsender.Noah Misch2020-08-10
| | | | | | | | | | | | | | This is like CVE-2018-1058 commit 582edc369cdbd348d68441fc50fa26a84afd0c1a. Today, a malicious user of a publisher or subscriber database can invoke arbitrary SQL functions under an identity running replication, often a superuser. This fix may cause "does not exist" or "no schema has been selected to create in" errors in a replication process. After upgrading, consider watching server logs for these errors. Objects accruing schema qualification in the wake of the earlier commit are unlikely to need further correction. Back-patch to v10, which introduced logical replication. Security: CVE-2020-14349
* Move connect.h from fe_utils to src/include/common.Noah Misch2020-08-10
| | | | | | | Any libpq client can use the header. Clients include backend components postgres_fdw, dblink, and logical replication apply worker. Back-patch to v10, because another fix needs this. In released branches, just copy the header and keep the original.
* Make contrib modules' installation scripts more secure.Tom Lane2020-08-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hostile objects located within the installation-time search_path could capture references in an extension's installation or upgrade script. If the extension is being installed with superuser privileges, this opens the door to privilege escalation. While such hazards have existed all along, their urgency increases with the v13 "trusted extensions" feature, because that lets a non-superuser control the installation path for a superuser-privileged script. Therefore, make a number of changes to make such situations more secure: * Tweak the construction of the installation-time search_path to ensure that references to objects in pg_catalog can't be subverted; and explicitly add pg_temp to the end of the path to prevent attacks using temporary objects. * Disable check_function_bodies within installation/upgrade scripts, so that any security gaps in SQL-language or PL-language function bodies cannot create a risk of unwanted installation-time code execution. * Adjust lookup of type input/receive functions and join estimator functions to complain if there are multiple candidate functions. This prevents capture of references to functions whose signature is not the first one checked; and it's arguably more user-friendly anyway. * Modify various contrib upgrade scripts to ensure that catalog modification queries are executed with secure search paths. (These are in-place modifications with no extension version changes, since it is the update process itself that is at issue, not the end result.) Extensions that depend on other extensions cannot be made fully secure by these methods alone; therefore, revert the "trusted" marking that commit eb67623c9 applied to earthdistance and hstore_plperl, pending some better solution to that set of issues. Also add documentation around these issues, to help extension authors write secure installation scripts. Patch by me, following an observation by Andres Freund; thanks to Noah Misch for review. Security: CVE-2020-14350
* Correct nbtree page split lock coupling comment.Peter Geoghegan2020-08-09
| | | | There is no reason to distinguish between readers and writers here.
* Check for fseeko() failure in pg_dump's _tarAddFile().Tom Lane2020-08-09
| | | | | | | | | Coverity pointed out, not unreasonably, that we checked fseeko's result at every other call site but these. Failure to seek in the temp file (note this is NOT pg_dump's output file) seems quite unlikely, and even if it did happen the file length cross-check further down would probably detect the problem. Still, that's a poor excuse for not checking the result of a system call.
* Remove useless Assert.Tom Lane2020-08-09
| | | | | | | | Testing that an unsigned variable is >= 0 is pretty pointless, as noted by Coverity and numerous buildfarm members. In passing, add comment about new uses of "volatile" --- Coverity doesn't much like that either, but it seems probably necessary.
* walsnd: Don't set waiting_for_ping_response spuriouslyAlvaro Herrera2020-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ashutosh Bapat noticed that when logical walsender needs to wait for WAL, and it realizes that it must send a keepalive message to walreceiver to update the sent-LSN, which *does not* request a reply from walreceiver, it wrongly sets the flag that it's going to wait for that reply. That means that any future would-be sender of feedback messages ends up not sending a feedback message, because they all believe that a reply is expected. With built-in logical replication there's not much harm in this, because WalReceiverMain will send a ping-back every wal_receiver_timeout/2 anyway; but with other logical replication systems (e.g. pglogical) it can cause significant pain. This problem was introduced in commit 41d5f8ad734, where the request-reply flag was changed from true to false to WalSndKeepalive, without at the same time removing the line that sets waiting_for_ping_response. Just removing that line would be a sufficient fix, but it seems better to shift the responsibility of setting the flag to WalSndKeepalive itself instead of requiring caller to do it; this is clearly less error-prone. Author: Álvaro Herrera <alvherre@alvh.no-ip.org> Reported-by: Ashutosh Bapat <ashutosh.bapat@2ndquadrant.com> Backpatch: 9.5 and up Discussion: https://postgr.es/m/20200806225558.GA22401@alvherre.pgsql
* Add some const decorationsPeter Eisentraut2020-08-08
|
* Implement streaming mode in ReorderBuffer.Amit Kapila2020-08-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit 45fdc9738b. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can't generate the complete tuple and stream. And, as soon as we get the complete tuple we stream the transaction including the serialized changes. We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages at each command end. These features are added by commits 0bead9af48 and c55040ccd0 respectively. Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continue with the decoding of other transactions. We have ReorderBufferTXN pointer in each ReorderBufferChange by which we know which xact it belongs to. The output plugin can use this to decide which changes to discard in case of stream_abort_cb (e.g. when a subxact gets discarded). We also provide a new option via SQL APIs to fetch the changes being streamed. Author: Dilip Kumar, Tomas Vondra, Amit Kapila, Nikhil Sontakke Reviewed-by: Amit Kapila, Kuntal Ghosh, Ajin Cherian Tested-by: Neha Sharma, Mahendra Singh Thalor and Ajin Cherian Discussion: https://postgr.es/m/688b0b7f-2f6c-d827-c27b-216a8e3ea700@2ndquadrant.com
* Make nbtree split REDO locking match original execution.Peter Geoghegan2020-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the nbtree page split REDO routine consistent with original execution in its approach to acquiring and releasing buffer locks (at least for pages on the tree level of the page being split). This brings btree_xlog_split() in line with btree_xlog_unlink_page(), which was taught to couple buffer locks by commit 9a9db08a. Note that the precise order in which we both acquire and release sibling buffer locks in btree_xlog_split() now matches original execution exactly (the precise order in which the locks are released probably doesn't matter much, but we might as well be consistent about it). The rule for nbtree REDO routines from here on is that same-level locks should be acquired in an order that's consistent with original execution. It's not practical to have a similar rule for cross-level page locks, since for the most part original execution holds those locks for a period that spans multiple atomic actions/WAL records. It's also not necessary, because clearly the cross-level lock coupling is only truly needed during original execution because of the presence of concurrent inserters. This is not a bug fix (unlike the similar aforementioned commit, commit 9a9db08a). The immediate reason to tighten things up in this area is to enable an upcoming enhancement to contrib/amcheck that allows it to verify that sibling links are in agreement with only an AccessShareLock (this check produced false positives when run on a replica server on account of the inconsistency fixed by this commit). But that's not the only reason to be stricter here. It is generally useful to make locking on replicas be as close to what happens during original execution as practically possible. It makes it less likely that hard to catch bugs will slip in in the future. The previous state of affairs seems to be a holdover from before the introduction of Hot Standby, when buffer lock acquisitions during recovery were totally unnecessary. See also: commit 3bbf668d, which tightened things up in this area a few years after the introduction of Hot Standby. Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
* Remove PROC_IN_ANALYZE and derived flagsAlvaro Herrera2020-08-07
| | | | | | These flags are unused and always have been. Discussion: https://postgr.es/m/20200805235549.GA8118@alvherre.pgsql
* Support testing of cases where table schemas change after planning.Tom Lane2020-08-07
| | | | | | | | | | | | | | | | | | | | | We have various cases where we allow DDL on tables to be performed with less than full AccessExclusiveLock. This requires concurrent queries to be able to cope with the DDL change mid-flight, but up to now we had no repeatable way to test such cases. To improve that, invent a test module that allows halting a backend after planning and then resuming execution once we've done desired actions in another session. (The same approach could be used to inject delays in other places, if there's a suitable hook available.) This commit includes a single test case, which is meant to exercise the previously-untestable ExecCreatePartitionPruneState code repaired by commit 7a980dfc6. We'd probably not bother with this if that were the only foreseen benefit, but I expect additional test cases will use this infrastructure in the future. Test module by Andy Fan, partition-addition test case by me. Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
* Rename nbtree split REDO routine variables.Peter Geoghegan2020-08-07
| | | | | | | | | | | | | Make the nbtree page split REDO routine variable names consistent with _bt_split() (which handles the original execution of page splits). These names make the code easier to follow by making the distinction between the original page and the left half of the split clear. (The left half of the split page is a temp page that REDO creates to replace the origpage contents.) Also reduce the elevel used when adding a new high key to the temp page from PANIC to ERROR to be consistent. We already only raise an ERROR when data item PageAddItem() temp page calls fail.
* Fix yet another issue with step generation in partition pruning.Etsuro Fujita2020-08-07
| | | | | | | | | | | | | | | | Commit 13838740f fixed some issues with step generation in partition pruning, but there was yet another one: get_steps_using_prefix() assumes that clauses in the passed-in prefix list are sorted in ascending order of their partition key numbers, but the caller failed to ensure this for range partitioning, which led to an assertion failure in debug builds. Adjust the caller function to arrange the clauses in the prefix list in the required order for range partitioning. Back-patch to v11, like the previous commit. Patch by me, reviewed by Amit Langote. Discussion: https://postgr.es/m/CAPmGK16jkXiFG0YqMbU66wte-oJTfW6D1HaNvQf%3D%2B5o9%3Dm55wQ%40mail.gmail.com
* Fix bogus EXPLAIN output for Hash AggregateDavid Rowley2020-08-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | 9bdb300de modified the EXPLAIN output for Hash Aggregate to show details from parallel workers. However, it neglected to consider that a given parallel worker may not have assisted with the given Hash Aggregate. This can occur when workers fail to start or during Parallel Append with enable_partitionwise_join enabled when only a single worker is working on a non-parallel aware sub-plan. It could also happen if a worker simply wasn't fast enough to get any work done before other processes went and finished all the work. The bogus output came from the fact that ExplainOpenWorker() skipped showing any details for non-initialized workers but show_hashagg_info() did show details from the worker. This meant that the worker properties that were shown were not properly attributed to the worker that they belong to. In passing, we also now don't show Hash Aggregate properties for the leader process when it did not contribute any work to the Hash Aggregate. This can occur either during Parallel Append when only a parallel worker worked on a given sub plan or with parallel_leader_participation set to off. This aims to make the behavior of Hash Aggregate's EXPLAIN output more similar to Sort's. Reported-by: Justin Pryzby Discussion: https://postgr.es/m/20200805012105.GZ28072%40telsasoft.com Backpatch-through: 13, where the original breakage was introduced
* Register llvm_shutdown using on_proc_exit, not before_shmem_exit.Robert Haas2020-08-06
| | | | | | | | | | | | | | | | | This seems more correct, because other before_shmem_exit calls may expect the infrastructure that is needed to run queries and access the database to be working, and also because this cleanup has nothing to do with shared memory. There are no known user-visible consequences to this, though, apart from what was previous fixed by commit 303640199d0436c5e7acdf50b837a027b5726594 and back-patched as commit bcbc27251d35336a6442761f59638138a772b839 and commit f7013683d9bb663a6a917421b1374306a32f165b, so for now, no back-patch. Bharath Rupireddy Discussion: http://postgr.es/m/CALj2ACWk7j4F2v2fxxYfrroOF=AdFNPr1WsV+AGtHAFQOqm_pw@mail.gmail.com
* Fix matching of sub-partitions when a partitioned plan is stale.Tom Lane2020-08-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we no longer require AccessExclusiveLock to add a partition, the executor may see that a partitioned table has more partitions than the planner saw. ExecCreatePartitionPruneState's code for matching up the partition lists in such cases was faulty, and would misbehave if the planner had successfully pruned any partitions from the query. (Thus, trouble would occur only if a partition addition happens concurrently with a query that uses both static and dynamic partition pruning.) This led to an Assert failure in debug builds, and probably to crashes or query misbehavior in production builds. To repair the bug, just explicitly skip zeroes in the plan's relid_map[] list. I also made some cosmetic changes to make the code more readable (IMO anyway). Also, convert the cross-checking Assert to a regular test-and-elog, since it's now apparent that this logic is more fragile than one would like. Currently, there's no way to repeatably exercise this code, except with manual use of a debugger to stop the backend between planning and execution. Hence, no test case in this patch. We oughta do something about that testability gap, but that's for another day. Amit Langote and Tom Lane, per report from Justin Pryzby. Oversight in commit 898e5e329; backpatch to v12 where that appeared. Discussion: https://postgr.es/m/20200802181131.GA27754@telsasoft.com
* Remove btree page items after page unlinkAlexander Korotkov2020-08-05
| | | | | | | | | | | | | | Currently, page unlink leaves remaining items "as is", but replay of corresponding WAL-record re-initializes page leaving it with no items. For the sake of consistency, this commit makes primary delete all the items during page unlink as well. Thanks to this change, we now don't mask contents of deleted btree page for WAL consistency checking. Discussion: https://postgr.es/m/CAPpHfdt_OTyQpXaPJcWzV2N-LNeNJseNB-K_A66qG%3DL518VTFw%40mail.gmail.com Author: Alexander Korotkov Reviewed-by: Peter Geoghegan
* Increase hard-wired timeout values in ecpg regression tests.Tom Lane2020-08-04
| | | | | | | | | | | | | | A couple of test cases had connect_timeout=14, a value that seems to have been plucked from a hat. While it's more than sufficient for normal cases, slow/overloaded buildfarm machines can get a timeout failure here, as per recent report from "sungazer". Increase to 180 seconds, which is in line with our typical timeouts elsewhere in the regression tests. Back-patch to 9.6; the code looks different in 9.5, and this doesn't seem to be quite worth the effort to adapt to that. Report: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-08-04%2007%3A12%3A22
* Make new SSL TAP test for channel_binding more robustMichael Paquier2020-08-04
| | | | | | | | | | The test would fail in an environment including a certificate file in ~/.postgresql/. bdd6e9b fixed a similar failure, and d6e612f introduced the same problem again with a new test. Author: Kyotaro Horiguchi Discussion: https://postgr.es/m/20200804.120033.31225582282178001.horikyota.ntt@gmail.com Backpatch-through: 13
* Fix replica backward scan race condition.Peter Geoghegan2020-08-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was possible for the logic used by backward scans (which must reason about concurrent page splits/deletions in its own peculiar way) to become confused when running on a replica. Concurrent replay of a WAL record that describes the second phase of page deletion could cause _bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed to adhere to the same locking protocol that we use on the primary, which is obviously wrong once you consider these two disparate functions together. This bug is present in all stable branches. More concretely, the problem was that nothing stopped _bt_walk_left() from observing inconsistencies between the deletion's target page and its original sibling pages when running on a replica. This is true even though the second phase of page deletion is supposed to work as a single atomic action. Queries running on replicas raised "could not find left sibling of block %u in index %s" can't-happen errors when they went back to their scan's "original" page and observed that the page has not been marked deleted (even though it really was concurrently deleted). There is no evidence that this actually happened in the real world. The issue came to light during unrelated feature development work. Note that _bt_walk_left() is the only code that cares about the difference between a half-dead page and a fully deleted page that isn't also exclusively used by nbtree VACUUM (unless you include contrib/amcheck code). It seems very likely that backward scans are the only thing that could become confused by the inconsistency. Even amcheck's complex bt_right_page_check_scankey() dance was unaffected. To fix, teach btree_xlog_unlink_page() to lock the left sibling, target, and right sibling pages in that order before releasing any locks (just like _bt_unlink_halfdead_page()). This is the simplest possible approach. There doesn't seem to be any opportunity to be more clever about lock acquisition in the REDO routine, and it hardly seems worth the trouble in any case. This fix might enable contrib/amcheck verification of leaf page sibling links with only an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was rejected back in January because it clashed with btree_xlog_unlink_page()'s lax approach to locking pages. It now seems likely that the real problem was with btree_xlog_unlink_page(), not the patch. This is a low severity, low likelihood bug, so no backpatch. Author: Michail Nikolaev Diagnosed-By: Michail Nikolaev Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
* Add nbtree page deletion assertion.Peter Geoghegan2020-08-03
| | | | | | | Add a documenting assertion that's similar to the nearby assertion added by commit cd8c73a3. This conveys that the entire call to _bt_pagedel() does no work if it isn't possible to get a descent stack for the initial scanblkno page.
* Remove unnecessary "DISTINCT" in psql's queries for \dAc and \dAf.Tom Lane2020-08-03
| | | | | | | | A moment's examination of these queries is sufficient to see that they do not produce duplicate rows, unless perhaps there's catalog corruption. Using DISTINCT anyway is inefficient and confusing; moreover it sets a poor example for anyone who refers to psql -E output to see how to query the catalogs.
* Fix behavior of ecpg's "EXEC SQL elif name".Tom Lane2020-08-03
| | | | | | | | | | | | | | | | | | This ought to work much like C's "#elif defined(name)"; but the code implemented it in a way equivalent to endif followed by ifdef, so that it didn't matter whether any previous branch of the IF construct had succeeded. Fix that; add some test cases covering elif and nested IFs; and improve the documentation, which also seemed a bit confused. AFAICS the code has been like this since the feature was added in 1999 (commit b57b0e044). So while it's surely wrong, there might be code out there relying on the current behavior. Hence, don't back-patch into stable branches. It seems all right to fix it in v13 though. Per report from Ashutosh Sharma. Reviewed by Ashutosh Sharma and Michael Meskes. Discussion: https://postgr.es/m/CAE9k0P=dQk9X0cU2tN49S7a9tv733-e1pVdpB1P-pWJ5PdTktg@mail.gmail.com
* Add %P to log_line_prefix for parallel group leaderMichael Paquier2020-08-03
| | | | | | | | | | | This is useful for monitoring purposes with log parsing. Similarly to pg_stat_activity, the leader's PID is shown only for active parallel workers, minimizing the log footprint for the leaders as the equivalent shared memory field is set as long as a backend is alive. Author: Justin Pryzby Reviewed-by: Álvaro Herrera, Michael Paquier, Julien Rouhaud, Tom Lane Discussion: https://postgr.es/m/20200315111831.GA21492@telsasoft.com
* Fix rare failure in LDAP tests.Thomas Munro2020-08-03
| | | | | | | | | | | Instead of writing a query to psql's stdin, use -c. This avoids a failure where psql exits before we write, seen a few times on the build farm. Thanks to Tom Lane for the suggestion. Back-patch to 11, where the LDAP tests arrived. Reviewed-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/CA%2BhUKGLFmW%2BHQYPeKiwSp5sdFFHtFViCpw4Mh6yAgEx74r5-Cw%40mail.gmail.com
* Correct comment in simplehash.h.Thomas Munro2020-08-03
| | | | | | | Post-commit review for commit 84c0e4b9. Author: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/CAApHDvptBx_%2BUPAzY0uXzopbvPVGKPeZ6Hoy8rnPcWz20Cr0Bw%40mail.gmail.com
* Fix minor issues in psql's new \dAc and related commands.Tom Lane2020-08-02
| | | | | | | | | | | | | | | | | The type-name pattern in \dAc and \dAf was matched only to the actual pg_type.typname string, which is fairly user-unfriendly in cases where that is not what's shown to the user by format_type (compare "_int4" and "integer[]"). Make this code match what \dT does, i.e. match the pattern against either typname or format_type() output. Also fix its broken handling of schema-name restrictions. (IOW, make these processSQLNamePattern calls match \dT's.) While here, adjust whitespace to make the query a little prettier in -E output, too. Also improve some inaccuracies and shaky grammar in the related documentation. Noted while working on a patch for intarray's opclasses; I wondered why I couldn't get a match to "integer*" for the input type name.
* Use int64 instead of long in incremental sort codeDavid Rowley2020-08-02
| | | | | | | | | | Windows 64bit has 4-byte long values which is not suitable for tracking disk space usage in the incremental sort code. Let's just make all these fields int64s. Author: James Coleman Discussion: https://postgr.es/m/CAApHDvpky%2BUhof8mryPf5i%3D6e6fib2dxHqBrhp0Qhu0NeBhLJw%40mail.gmail.com Backpatch-through: 13, where the incremental sort code was added
* Change XID and mxact limits to warn at 40M and stop at 3M.Noah Misch2020-08-01
| | | | | | | | | | | | | | | We have edge-case bugs when assigning values in the last few dozen pages before the wrap limit. We may introduce similar bugs in the future. At default BLCKSZ, this makes such bugs unreachable outside of single-user mode. Also, when VACUUM began to consume mxacts, multiStopLimit did not change to compensate. pg_upgrade may fail on a cluster that was already printing "must be vacuumed" warnings. Follow the warning's instructions to clear the warning, then run pg_upgrade again. One can still, peacefully consume 98% of XIDs or mxacts, so DBAs need not change routine VACUUM settings. Discussion: https://postgr.es/m/20200621083513.GA3074645@rfd.leadboat.com
* Invent "amadjustmembers" AM method for validating opclass members.Tom Lane2020-08-01
| | | | | | | | | | | | | | | | | | | | | | | | | This allows AM-specific knowledge to be applied during creation of pg_amop and pg_amproc entries. Specifically, the AM knows better than core code which entries to consider as required or optional. Giving the latter entries the appropriate sort of dependency allows them to be dropped without taking out the whole opclass or opfamily; which is something we'd like to have to correct obsolescent entries in extensions. This callback also opens the door to performing AM-specific validity checks during opclass creation, rather than hoping than an opclass developer will remember to test with "amvalidate". For the most part I've not actually added any such checks yet; that can happen in a follow-on patch. (Note that we shouldn't remove any tests from "amvalidate", as those are still needed to cross-check manually constructed entries in the initdb data. So adding tests to "amadjustmembers" will be somewhat duplicative, but it seems like a good idea anyway.) Patch by me, reviewed by Alexander Korotkov, Hamid Akhtar, and Anastasia Lubennikova. Discussion: https://postgr.es/m/4578.1565195302@sss.pgh.pa.us
* Use pg_pread() and pg_pwrite() in slru.c.Thomas Munro2020-08-02
| | | | | | | | | | This avoids lseek() system calls at every SLRU I/O, as was done for relation files in commit c24dcd0c. Reviewed-by: Ashwin Agrawal <aagrawal@pivotal.io> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2Biqke4uTRFj8D8uEUUgj%2BRokPSp%2BCWM6YYzaaamG9Wvg%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJ%2BoHhnvqjn3%3DHro7xu-YDR8FPr0FL6LF35kHRX%3D_bUzg%40mail.gmail.com
* Minimize slot creation for multi-inserts of pg_shdependMichael Paquier2020-08-01
| | | | | | | | | | | | | | When doing multiple insertions in pg_shdepend for the copy of dependencies from a template database in CREATE DATABASE, the same number of slots would have been created and used all the time. As the number of items to insert is not known in advance, this makes most of the slots created for nothing. This improves the slot handling so as slot creation only happens when needed, minimizing the overhead of the operation. Author: Michael Paquier Reviewed-by: Daniel Gustafsson Discussion: https://postgr.es/m/20200731024148.GB3317@paquier.xyz
* Improve programmer docs for simplehash and dynahash.Thomas Munro2020-08-01
| | | | | | | | | | | When reading the code it's not obvious when one should prefer dynahash over simplehash and vice-versa, so, for programmer-friendliness, add comments to inform that decision. Show sample simplehash method signatures. Author: James Coleman <jtc331@gmail.com> Discussion: https://postgr.es/m/CAAaqYe_dOF39gAJ8rL-a3YO3Qo96MHMRQ2whFjK5ZcU6YvMQSA%40mail.gmail.com
* Fix oversight in ALTER TYPE: typmodin/typmodout must propagate to arrays.Tom Lane2020-07-31
| | | | | | | | | | | If a base type supports typmods, its array type does too, with the same interpretation. Hence changes in pg_type.typmodin/typmodout must be propagated to the array type. While here, improve AlterTypeRecurse to not recurse to domains if there is nothing we'd need to change. Oversight in fe30e7ebf. Back-patch to v13 where that came in.
* Fix recently-introduced performance problem in ts_headline().Tom Lane2020-07-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new hlCover() algorithm that I introduced in commit c9b0c678d turns out to potentially take O(N^2) or worse time on long documents, if there are many occurrences of individual query words but few or no substrings that actually satisfy the query. (One way to hit this behavior is with a "common_word & rare_word" type of query.) This seems unavoidable given the original goal of checking every substring of the document, so we have to back off that idea. Fortunately, it seems unlikely that anyone would really want headlines spanning all of a long document, so we can avoid the worse-than-linear behavior by imposing a maximum length of substring that we'll consider. For now, just hard-wire that maximum length as a multiple of max_words times max_fragments. Perhaps at some point somebody will argue for exposing it as a ts_headline parameter, but I'm hesitant to make such a feature addition in a back-patched bug fix. I also noted that the hlFirstIndex() function I'd added in that commit was unnecessarily stupid: it really only needs to check whether a HeadlineWordEntry's item pointer is null or not. This wouldn't make all that much difference in typical cases with queries having just a few terms, but a cycle shaved is a cycle earned. In addition, add a CHECK_FOR_INTERRUPTS call in TS_execute_recurse. This ensures that hlCover's loop is cancellable if it manages to take a long time, and it may protect some other TS_execute callers as well. Back-patch to 9.6 as the previous commit was. I also chose to add the CHECK_FOR_INTERRUPTS call to 9.5. The old hlCover() algorithm seems to avoid the O(N^2) behavior, at least on the test case I tried, but nonetheless it's not very quick on a long document. Per report from Stephen Frost. Discussion: https://postgr.es/m/20200724160535.GW12375@tamriel.snowman.net
* Fix compiler warning from Clang.Thomas Munro2020-07-31
| | | | | | Per build farm. Discussion: https://postgr.es/m/20200731062626.GD3317%40paquier.xyz
* Preallocate some DSM space at startup.Thomas Munro2020-07-31
| | | | | | | | | | | | | | | Create an optional region in the main shared memory segment that can be used to acquire and release "fast" DSM segments, and can benefit from huge pages allocated at cluster startup time, if configured. Fall back to the existing mechanisms when that space is full. The size is controlled by a new GUC min_dynamic_shared_memory, defaulting to 0. Main region DSM segments initially contain whatever garbage the memory held last time they were used, rather than zeroes. That change revealed that DSA areas failed to initialize themselves correctly in memory that wasn't zeroed first, so fix that problem. Discussion: https://postgr.es/m/CA%2BhUKGLAE2QBv-WgGp%2BD9P_J-%3Dyne3zof9nfMaqq1h3EGHFXYQ%40mail.gmail.com
* Fix comment in instrument.hMichael Paquier2020-07-31
| | | | | | | | local_blks_dirtied tracks the number of local blocks dirtied, not shared ones. Author: Kirk Jamison Discussion: https://postgr.es/m/OSBPR01MB2341760686DC056DE89D2AB9EF710@OSBPR01MB2341.jpnprd01.prod.outlook.com
* Cache smgrnblocks() results in recovery.Thomas Munro2020-07-31
| | | | | | | | | | | Avoid repeatedly calling lseek(SEEK_END) during recovery by caching the size of each fork. For now, we can't use the same technique in other processes, because we lack a shared invalidation mechanism. Do this by generalizing the pre-existing caching used by FSM and VM to support all forks. Discussion: https://postgr.es/m/CAEepm%3D3SSw-Ty1DFcK%3D1rU-K6GSzYzfdD4d%2BZwapdN7dTa6%3DnQ%40mail.gmail.com
* Use multi-inserts for pg_attribute and pg_shdependMichael Paquier2020-07-31
| | | | | | | | | | | | | | | | | | | For pg_attribute, this allows to insert at once a full set of attributes for a relation (roughly 15% of WAL reduction in extreme cases). For pg_shdepend, this reduces the work done when creating new shared dependencies from a database template. The number of slots used for the insertion is capped at 64kB of data inserted for both, depending on the number of items to insert and the length of the rows involved. More can be done for other catalogs, like pg_depend. This part requires a different approach as the number of slots to use depends also on the number of entries discarded as pinned dependencies. This is also related to the rework or dependency handling for ALTER TABLE and CREATE TABLE, mainly. Author: Daniel Gustafsson Reviewed-by: Andres Freund, Michael Paquier Discussion: https://postgr.es/m/20190213182737.mxn6hkdxwrzgxk35@alap3.anarazel.de
* Use pg_bitutils for HyperLogLog.Jeff Davis2020-07-30
| | | | | | | | | | | Using pg_leftmost_one_post32() yields substantial performance benefits. Backpatching to version 13 because HLL is used for HashAgg improvements in 9878b643, which was also backpatched to 13. Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/CAH2-WzkGvDKVDo+0YvfvZ+1CE=iCi88DCOGFF3i1hTGGaxcKPw@mail.gmail.com Backpatch-through: 13
* Include partitioned tables for tab completion of VACUUM in psqlMichael Paquier2020-07-30
| | | | | | | | | | The relkinds that support indexing are the same as the ones supporting VACUUM, so the code gets refactored a bit with the completion query used for CLUSTER, but there is no change for CLUSTER in this commit. Author: Justin Pryzby Reviewed-by: Fujii Masao, Michael Paquier, Masahiko Sawada Discussion: https://postgr.es/m/20200728170408.GI20393@telsasoft.com
* Introduce a WaitEventSet for the stats collector.Thomas Munro2020-07-30
| | | | | | | This avoids avoids some epoll/kqueue system calls for every wait. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
* Use WaitLatch() for condition variables.Thomas Munro2020-07-30
| | | | | | | | | Previously, condition_variable.c created a long lived WaitEventSet to avoid extra system calls. WaitLatch() now uses something similar internally, so there is no point in wasting an extra kernel descriptor. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
* Use a long lived WaitEventSet for WaitLatch().Thomas Munro2020-07-30
| | | | | | | | | | | | | Create LatchWaitSet at backend startup time, and use it to implement WaitLatch(). This avoids repeated epoll/kqueue setup and teardown system calls. Reorder SubPostmasterMain() slightly so that we restore the postmaster pipe and Windows signal emulation before we reach InitPostmasterChild(), to make this work in EXEC_BACKEND builds. Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJAC4Oqao%3DqforhNey20J8CiG2R%3DoBPqvfR0vOJrFysGw%40mail.gmail.com
* Add hash_mem_multiplier GUC.Peter Geoghegan2020-07-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a GUC that acts as a multiplier on work_mem. It gets applied when sizing executor node hash tables that were previously size constrained using work_mem alone. The new GUC can be used to preferentially give hash-based nodes more memory than the generic work_mem limit. It is intended to enable admin tuning of the executor's memory usage. Overall system throughput and system responsiveness can be improved by giving hash-based executor nodes more memory (especially over sort-based alternatives, which are often much less sensitive to being memory constrained). The default value for hash_mem_multiplier is 1.0, which is also the minimum valid value. This means that hash-based nodes continue to apply work_mem in the traditional way by default. hash_mem_multiplier is generally useful. However, it is being added now due to concerns about hash aggregate performance stability for users that upgrade to Postgres 13 (which added disk-based hash aggregation in commit 1f39bce0). While the old hash aggregate behavior risked out-of-memory errors, it is nevertheless likely that many users actually benefited. Hash agg's previous indifference to work_mem during query execution was not just faster; it also accidentally made aggregation resilient to grouping estimate problems (at least in cases where this didn't create destabilizing memory pressure). hash_mem_multiplier can provide a certain kind of continuity with the behavior of Postgres 12 hash aggregates in cases where the planner incorrectly estimates that all groups (plus related allocations) will fit in work_mem/hash_mem. This seems necessary because hash-based aggregation is usually much slower when only a small fraction of all groups can fit. Even when it isn't possible to totally avoid hash aggregates that spill, giving hash aggregation more memory will reliably improve performance (the same cannot be said for external sort operations, which appear to be almost unaffected by memory availability provided it's at least possible to get a single merge pass). The PostgreSQL 13 release notes should advise users that increasing hash_mem_multiplier can help with performance regressions associated with hash aggregation. That can be taken care of by a later commit. Author: Peter Geoghegan Reviewed-By: Álvaro Herrera, Jeff Davis Discussion: https://postgr.es/m/20200625203629.7m6yvut7eqblgmfo@alap3.anarazel.de Discussion: https://postgr.es/m/CAH2-WzmD%2Bi1pG6rc1%2BCjc4V6EaFJ_qSuKCCHVnH%3DoruqD-zqow%40mail.gmail.com Backpatch: 13-, where disk-based hash aggregation was introduced.
* Remove non-fast promotion.Fujii Masao2020-07-29
| | | | | | | | | | | | | | | When fast promotion was supported in 9.3, non-fast promotion became undocumented feature and it's basically not available for ordinary users. However we decided not to remove non-fast promotion at that moment, to leave it for a release or two for debugging purpose or as an emergency method because fast promotion might have some issues, and then to remove it later. Now, several versions were released since that decision and there is no longer reason to keep supporting non-fast promotion. Therefore this commit removes non-fast promotion. Author: Fujii Masao Reviewed-by: Hamid Akhtar, Kyotaro Horiguchi Discussion: https://postgr.es/m/76066434-648f-f567-437b-54853b43398f@oss.nttdata.com
* HashAgg: use better cardinality estimate for recursive spilling.Jeff Davis2020-07-28
| | | | | | | | | | | | | | | | Use HyperLogLog to estimate the group cardinality in a spilled partition. This estimate is used to choose the number of partitions if we recurse. The previous behavior was to use the number of tuples in a spilled partition as the estimate for the number of groups, which lead to overpartitioning. That could cause the number of batches to be much higher than expected (with each batch being very small), which made it harder to interpret EXPLAIN ANALYZE results. Reviewed-by: Peter Geoghegan Discussion: https://postgr.es/m/a856635f9284bc36f7a77d02f47bbb6aaf7b59b3.camel@j-davis.com Backpatch-through: 13
* Fix incorrect print format in json.cMichael Paquier2020-07-29
| | | | | | | | Oid is unsigned, so %u needs to be used and not %d. The code path involved here is not normally reachable, so no backpatch is done. Author: Justin Pryzby Discussion: https://postgr.es/m/20200728015523.GA27308@telsasoft.com