aboutsummaryrefslogtreecommitdiff
path: root/src/backend/utils/misc
Commit message (Collapse)AuthorAge
* Allow compute_query_id to be set to 'auto' and make it defaultAlvaro Herrera2021-05-15
| | | | | | | | | | | | | | | | | Allowing only on/off meant that all either all existing configuration guides would become obsolete if we disabled it by default, or that we would have to accept a performance loss in the default config if we enabled it by default. By allowing 'auto' as a middle ground, the performance cost is only paid by those who enable pg_stat_statements and similar modules. I only edited the release notes to comment-out a paragraph that is now factually wrong; further edits are probably needed to describe the related change in more detail. Author: Julien Rouhaud <rjuju123@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20210513002623.eugftm4nk2lvvks3@nol
* Initial pgindent and pgperltidy run for v14.Tom Lane2021-05-12
| | | | | | | | Also "make reformat-dat-files". The only change worthy of note is that pgindent messed up the formatting of launcher.c's struct LogicalRepWorkerId, which led me to notice that that struct wasn't used at all anymore, so I just took it out.
* Revert recovery prefetching feature.Thomas Munro2021-05-10
| | | | | | | | | | | | | | | | | | This set of commits has some bugs with known fixes, but at this late stage in the release cycle it seems best to revert and resubmit next time, along with some new automated test coverage for this whole area. Commits reverted: dc88460c: Doc: Review for "Optionally prefetch referenced data in recovery." 1d257577: Optionally prefetch referenced data in recovery. f003d9f8: Add circular WAL decoding buffer. 323cbe7c: Remove read_page callback from XLogReader. Remove the new GUC group WAL_RECOVERY recently added by a55a9847, as the corresponding section of config.sgml is now reverted. Discussion: https://postgr.es/m/CAOuzzgrn7iKnFRsB4MHp3UisEQAGgZMbk_ViTN4HV4-Ksq8zCg%40mail.gmail.com
* Sync guc.c and postgresql.conf.sample with the SGML docs.Tom Lane2021-05-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that various people have moved GUCs around in the config.sgml listing without bothering to make the code agree. Ensure that the config_group codes assigned to GUCs match where they are listed in config.sgml. Likewise ensure that postgresql.conf.sample lists GUCs in the same sub-section and same ordering as they appear in config.sgml. (I've got some doubts about some of these choices, but for the purposes of this patch, we'll treat config.sgml as gospel.) Notably, this requires adding a WAL_RECOVERY config_group value, because 1d257577e didn't. As long as we're renumbering that enum anyway, let's take out the values corresponding to major groups that are divided into sub-groups. No GUC should be assigned to the major group itself, so those values just create a temptation to do the wrong thing, while adding work for translators. In passing, adjust the short_desc strings for PRESET_OPTIONS GUCs to uniformly use the phrasing "Shows XYZ.", removing the impression some of these strings left that you can set the value. While some of these errors are old, no back-patch, as changing the contents of the pg_settings view in stable branches seems more likely to be seen as a compatibility break than anything helpful. Bharath Rupireddy, Justin Pryzby, Tom Lane Discussion: https://postgr.es/m/16997-ff16127f6e0d1390@postgresql.org Discussion: https://postgr.es/m/20210413123139.GE6091@telsasoft.com
* GUC description improvements for clarityPeter Eisentraut2021-05-05
|
* Update query_id computationBruce Momjian2021-05-03
| | | | | | | | | | | | | Properly fix: - the "ONLY" in FROM [ONLY] isn't hashed - the agglevelsup field in GROUPING isn't hashed - WITH TIES not being hashed (new in PG 13) - "DISTINCT" in "GROUP BY [DISTINCT]" isn't hashed (new in PG 14) Reported-by: Julien Rouhaud Discussion: https://postgr.es/m/20210425081119.ulyzxqz23ueh3wuj@nol
* Fix incorrect format placeholderPeter Eisentraut2021-04-23
|
* adjust query id feature to use pg_stat_activity.query_idBruce Momjian2021-04-20
| | | | | | | | | | | Previously, it was pg_stat_activity.queryid to match the pg_stat_statements queryid column. This is an adjustment to patch 4f0b0966c8. This also adjusts some of the internal function calls to match. Catversion bumped. Reported-by: Álvaro Herrera, Julien Rouhaud Discussion: https://postgr.es/m/20210408032704.GA7498@alvherre.pgsql
* doc: Move force_parallel_mode to section for developer optionsMichael Paquier2021-04-14
| | | | | | | | | | | | | | | | | | | | This GUC has always been classified as a planner option since its introduction in 7c944bd, and was listed in postgresql.conf.sample. As this parameter exists for testing purposes, move it to the section dedicated to developer parameters and hence remove it from postgresql.conf.sample. This will avoid any temptation to play with it on production servers for users that should never really have to touch this parameter. The general description used for developer options is reworded a bit, to take into account the inclusion of force_parallel_mode, per a suggestion from Tom Lane. Per discussion between Tom Lane, Bruce Momjian, Justin Pryzby, Bharath Rupireddy and me. Author: Justin Pryzby, Tom Lane Discussion: https://postgr.es/m/20210403152402.GA8049@momjian.us
* Fix some inappropriately-disallowed uses of ALTER ROLE/DATABASE SET.Tom Lane2021-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most GUC check hooks that inspect database state have special checks that prevent them from throwing hard errors for state-dependent issues when source == PGC_S_TEST. This allows, for example, "ALTER DATABASE d SET default_text_search_config = foo" when the "foo" configuration hasn't been created yet. Without this, we have problems during dump/reload or pg_upgrade, because pg_dump has no idea about possible dependencies of GUC values and can't ensure a safe restore ordering. However, check_role() and check_session_authorization() hadn't gotten the memo about that, and would throw hard errors anyway. It's not entirely clear what is the use-case for "ALTER ROLE x SET role = y", but we've now heard two independent complaints about that bollixing an upgrade, so apparently some people are doing it. Hence, fix these two functions to act more like other check hooks with similar needs. (But I did not change their insistence on being inside a transaction, as it's still not apparent that setting either GUC from the configuration file would be wise.) Also fix check_temp_buffers, which had a different form of the disease of making state-dependent checks without any exception for PGC_S_TEST. A cursory survey of other GUC check hooks did not find any more issues of this ilk. (There are a lot of interdependencies among PGC_POSTMASTER and PGC_SIGHUP GUCs, which may be a bad idea, but they're not relevant to the immediate concern because they can't be set via ALTER ROLE/DATABASE.) Per reports from Charlie Hornsby and Nathan Bossart. Back-patch to all supported branches. Discussion: https://postgr.es/m/HE1P189MB0523B31598B0C772C908088DB7709@HE1P189MB0523.EURP189.PROD.OUTLOOK.COM Discussion: https://postgr.es/m/20160711223641.1426.86096@wrigleys.postgresql.org
* Move log_autovacuum_min_duration into its correct sectionsMichael Paquier2021-04-12
| | | | | | | | | This GUC has already been classified as LOGGING_WHAT, but its location in postgresql.conf.sample and the documentation did not reflect that, so fix those inconsistencies. Author: Justin Pryzby Discussion: https://postgr.es/m/20210404012546.GK6592@telsasoft.com
* Doc: update documentation of check_function_bodies.Tom Lane2021-04-10
| | | | | | | | | | Adjust docs and description string to note that check_function_bodies applies to procedures too. (In hindsight it should have been named check_routine_bodies, but it seems too late for that now.) Daniel Westermann Discussion: https://postgr.es/m/GV0P278MB04834A9EB9A74B036DC7CE49D2739@GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM
* Make new GUC short descriptions more consistent.Thomas Munro2021-04-10
| | | | | Reported-by: Daniel Westermann (DWE) <daniel.westermann@dbi-services.com> Discussion: https://postgr.es/m/GV0P278MB0483490FEAC879DCA5ED583DD2739%40GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM
* Doc: Review for "Optionally prefetch referenced data in recovery."Thomas Munro2021-04-10
| | | | | | | | Typos, corrections and language improvements in the docs, and a few in code comments too. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/20210409033703.GP6592%40telsasoft.com
* Fix typoMagnus Hagander2021-04-09
| | | | | | Author: Daniel Westermann Backpatch-through: 9.6 Discussion: https://postgr.es/m/GV0P278MB0483A7AA85BAFCC06D90F453D2739@GV0P278MB0483.CHEP278.PROD.OUTLOOK.COM
* Optionally prefetch referenced data in recovery.Thomas Munro2021-04-08
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a new GUC recovery_prefetch, disabled by default. When enabled, look ahead in the WAL and try to initiate asynchronous reading of referenced data blocks that are not yet cached in our buffer pool. For now, this is done with posix_fadvise(), which has several caveats. Better mechanisms will follow in later work on the I/O subsystem. The GUC maintenance_io_concurrency is used to limit the number of concurrent I/Os we allow ourselves to initiate, based on pessimistic heuristics used to infer that I/Os have begun and completed. The GUC wal_decode_buffer_size is used to limit the maximum distance we are prepared to read ahead in the WAL to find uncached blocks. Reviewed-by: Alvaro Herrera <alvherre@2ndquadrant.com> (parts) Reviewed-by: Andres Freund <andres@anarazel.de> (parts) Reviewed-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> (parts) Tested-by: Tomas Vondra <tomas.vondra@2ndquadrant.com> Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com> Tested-by: Dmitry Dolgov <9erthalion6@gmail.com> Tested-by: Sait Talha Nisanci <Sait.Nisanci@microsoft.com> Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com
* Add wraparound failsafe to VACUUM.Peter Geoghegan2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a failsafe mechanism that is triggered by VACUUM when it notices that the table's relfrozenxid and/or relminmxid are dangerously far in the past. VACUUM checks the age of the table dynamically, at regular intervals. When the failsafe triggers, VACUUM takes extraordinary measures to finish as quickly as possible so that relfrozenxid and/or relminmxid can be advanced. VACUUM will stop applying any cost-based delay that may be in effect. VACUUM will also bypass any further index vacuuming and heap vacuuming -- it only completes whatever remaining pruning and freezing is required. Bypassing index/heap vacuuming is enabled by commit 8523492d, which made it possible to dynamically trigger the mechanism already used within VACUUM when it is run with INDEX_CLEANUP off. It is expected that the failsafe will almost always trigger within an autovacuum to prevent wraparound, long after the autovacuum began. However, the failsafe mechanism can trigger in any VACUUM operation. Even in a non-aggressive VACUUM, where we're likely to not advance relfrozenxid, it still seems like a good idea to finish off remaining pruning and freezing. An aggressive/anti-wraparound VACUUM will be launched immediately afterwards. Note that the anti-wraparound VACUUM that follows will itself trigger the failsafe, usually before it even begins its first (and only) pass over the heap. The failsafe is controlled by two new GUCs: vacuum_failsafe_age, and vacuum_multixact_failsafe_age. There are no equivalent reloptions, since that isn't expected to be useful. The GUCs have rather high defaults (both default to 1.6 billion), and are expected to generally only be used to make the failsafe trigger sooner/more frequently. Author: Masahiko Sawada <sawada.mshk@gmail.com> Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAD21AoD0SkE11fMw4jD4RENAwBMcw1wasVnwpJVw3tVqPOQgAw@mail.gmail.com Discussion: https://postgr.es/m/CAH2-WzmgH3ySGYeC-m-eOBsa2=sDwa292-CFghV4rESYo39FsQ@mail.gmail.com
* Make use of in-core query id added by commit 5fd9dfa5f5Bruce Momjian2021-04-07
| | | | | | | | | | | | | | | | | | | | | | | Use the in-core query id computation for pg_stat_activity, log_line_prefix, and EXPLAIN VERBOSE. Similar to other fields in pg_stat_activity, only the queryid from the top level statements are exposed, and if the backends status isn't active then the queryid from the last executed statements is displayed. Add a %Q placeholder to include the queryid in log_line_prefix, which will also only expose top level statements. For EXPLAIN VERBOSE, if a query identifier has been computed, either by enabling compute_query_id or using a third-party module, display it. Bump catalog version. Discussion: https://postgr.es/m/20210407125726.tkvjdbw76hxnpwfi@nol Author: Julien Rouhaud Reviewed-by: Alvaro Herrera, Nitin Jadhav, Zhihong Yu
* Move pg_stat_statements query jumbling to core.Bruce Momjian2021-04-07
| | | | | | | | | | | | | | | | | | | Add compute_query_id GUC to control whether a query identifier should be computed by the core (off by default). It's thefore now possible to disable core queryid computation and use pg_stat_statements with a different algorithm to compute the query identifier by using a third-party module. To ensure that a single source of query identifier can be used and is well defined, modules that calculate a query identifier should throw an error if compute_query_id specified to compute a query id and if a query idenfitier was already calculated. Discussion: https://postgr.es/m/20210407125726.tkvjdbw76hxnpwfi@nol Author: Julien Rouhaud Reviewed-by: Alvaro Herrera, Nitin Jadhav, Zhihong Yu
* Tighten up allowed names for custom GUC parameters.Tom Lane2021-04-07
| | | | | | | | | | | | | | | | | | | | | Formerly we were pretty lax about what a custom GUC's name could be; so long as it had at least one dot in it, we'd take it. However, corner cases such as dashes or equal signs in the name would cause various bits of functionality to misbehave. Rather than trying to make the world perfectly safe for that, let's just require that custom names look like "identifier.identifier", where "identifier" means something that scan.l would accept without double quotes. Along the way, this patch refactors things slightly in guc.c so that find_option() is responsible for reporting GUC-not-found cases, allowing removal of duplicative code from its callers. Per report from Hubert Depesz Lubaczewski. No back-patch, since the consequences of the problem don't seem to warrant changing behavior in stable branches. Discussion: https://postgr.es/m/951335.1612910077@sss.pgh.pa.us
* Split backend status and progress related functionality out of pgstat.c.Andres Freund2021-04-03
| | | | | | | | | | | | | | | | | | Backend status (supporting pg_stat_activity) and command progress (supporting pg_stat_progress*) related code is largely independent from the rest of pgstat.[ch] (supporting views like pg_stat_all_tables that accumulate data over time). See also a333476b925. This commit doesn't rename the function names to make the distinction from the rest of pgstat_ clearer - that'd be more invasive and not clearly beneficial. If we were to decide to do such a rename at some point, it's better done separately from moving the code as well. Robert's review was of an earlier version. Reviewed-By: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/20210316195440.twxmlov24rr2nxrg@alap3.anarazel.de
* Detect POLLHUP/POLLRDHUP while running queries.Thomas Munro2021-04-03
| | | | | | | | | | | | | | | | | | | | | | | | | | Provide a new GUC check_client_connection_interval that can be used to check whether the client connection has gone away, while running very long queries. It is disabled by default. For now this uses a non-standard Linux extension (also adopted by at least one other OS). POLLRDHUP is not defined by POSIX, and other OSes don't have a reliable way to know if a connection was closed without actually trying to read or write. In future we might consider trying to send a no-op/heartbeat message instead, but that could require protocol changes. Author: Sergey Cherkashin <s.cherkashin@postgrespro.ru> Author: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Thomas Munro <thomas.munro@gmail.com> Reviewed-by: Tatsuo Ishii <ishii@sraoss.co.jp> Reviewed-by: Konstantin Knizhnik <k.knizhnik@postgrespro.ru> Reviewed-by: Zhihong Yu <zyu@yugabyte.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Maksim Milyutin <milyutinma@gmail.com> Reviewed-by: Tsunakawa, Takayuki/綱川 貴之 <tsunakawa.takay@fujitsu.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (much earlier version) Discussion: https://postgr.es/m/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
* Add Result Cache executor node (take 2)David Rowley2021-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Rename Default Roles to Predefined RolesStephen Frost2021-04-01
| | | | | | | | | | | | | The term 'default roles' wasn't quite apt as these roles aren't able to be modified or removed after installation, so rename them to be 'Predefined Roles' instead, adding an entry into the newly added Obsolete Appendix to help users of current releases find the new documentation. Bruce Momjian and Stephen Frost Discussion: https://postgr.es/m/157742545062.1149.11052653770497832538%40wrigleys.postgresql.org and https://www.postgresql.org/message-id/20201120211304.GG16415@tamriel.snowman.net
* Revert b6002a796David Rowley2021-04-01
| | | | | | | | | | | | | This removes "Add Result Cache executor node". It seems that something weird is going on with the tracking of cache hits and misses as highlighted by many buildfarm animals. It's not yet clear what the problem is as other parts of the plan indicate that the cache did work correctly, it's just the hits and misses that were being reported as 0. This is especially a bad time to have the buildfarm so broken, so reverting before too many more animals go red. Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
* Add Result Cache executor nodeDavid Rowley2021-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Add support for asynchronous execution.Etsuro Fujita2021-03-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements asynchronous execution, which runs multiple parts of a non-parallel-aware Append concurrently rather than serially to improve performance when possible. Currently, the only node type that can be run concurrently is a ForeignScan that is an immediate child of such an Append. In the case where such ForeignScans access data on different remote servers, this would run those ForeignScans concurrently, and overlap the remote operations to be performed simultaneously, so it'll improve the performance especially when the operations involve time-consuming ones such as remote join and remote aggregation. We may extend this to other node types such as joins or aggregates over ForeignScans in the future. This also adds the support for postgres_fdw, which is enabled by the table-level/server-level option "async_capable". The default is false. Robert Haas, Kyotaro Horiguchi, Thomas Munro, and myself. This commit is mostly based on the patch proposed by Robert Haas, but also uses stuff from the patch proposed by Kyotaro Horiguchi and from the patch proposed by Thomas Munro. Reviewed by Kyotaro Horiguchi, Konstantin Knizhnik, Andrey Lepikhov, Movead Li, Thomas Munro, Justin Pryzby, and others. Discussion: https://postgr.es/m/CA%2BTgmoaXQEt4tZ03FtQhnzeDEMzBck%2BLrni0UWHVVgOTnA6C1w%40mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGLBRyu0rHrDCMC4%3DRn3252gogyp1SjOgG8SEKKZv%3DFwfQ%40mail.gmail.com Discussion: https://postgr.es/m/20200228.170650.667613673625155850.horikyota.ntt%40gmail.com
* Change checkpoint_completion_target default to 0.9Stephen Frost2021-03-24
| | | | | | | | | | | | | | | | | | | Common recommendations are that the checkpoint should be spread out as much as possible, provided we avoid having it take too long. This change updates the default to 0.9 (from 0.5) to match that recommendation. There was some debate about possibly removing the option entirely but it seems there may be some corner-cases where having it set much lower to try to force the checkpoint to be as fast as possible could result in fewer periods of time of reduced performance due to kernel flushing. General agreement is that the "spread more" is the preferred approach though and those who need to tune away from that value are much less common. Reviewed-By: Michael Paquier, Peter Eisentraut, Tom Lane, David Steele, Nathan Bossart Discussion: https://postgr.es/m/20201207175329.GM16415%40tamriel.snowman.net
* Tidy up more loose ends related to configurable TOAST compression.Robert Haas2021-03-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | Change the default_toast_compression GUC to be an enum rather than a string. Earlier, uncommitted versions of the patch supported using CREATE ACCESS METHOD to add new compression methods to a running system, but that idea was dropped before commit. So, we can simplify the GUC handling as well, which has the nice side effect of improving the error messages. While updating the documentation to reflect the new GUC type, also move it back to the right place in the list. I moved this while revising what became commit 24f0e395ac5892cd12e8914646fe921fac5ba23d, but apparently the intended ordering is "alphabetical" rather than "whatever Robert thinks looks nice." Rejigger things to avoid having access/toast_compression.h depend on utils/guc.h, so that we don't end up with every file that includes it also depending on something largely unrelated. Move a few inline functions back into the C source file partly to help reduce dependencies and partly just to avoid clutter. A few very minor cosmetic fixes. Original patch by Justin Pryzby, but very heavily edited by me, and reverse reviewed by him and also reviewed by by Tom Lane. Discussion: http://postgr.es/m/CA+TgmoYp=GT_ztUCeZg2i4hkHAQv8o=-nVJ1-TKWTG1zQOmOpg@mail.gmail.com
* Revert "Enable parallel SELECT for "INSERT INTO ... SELECT ..."."Amit Kapila2021-03-24
| | | | | | | | | | | | | | | | | | | To allow inserts in parallel-mode this feature has to ensure that all the constraints, triggers, etc. are parallel-safe for the partition hierarchy which is costly and we need to find a better way to do that. Additionally, we could have used existing cached information in some cases like indexes, domains, etc. to determine the parallel-safety. List of commits reverted, in reverse chronological order: ed62d3737c Doc: Update description for parallel insert reloption. c8f78b6161 Add a new GUC and a reloption to enable inserts in parallel-mode. c5be48f092 Improve FK trigger parallel-safety check added by 05c8482f7f. e2cda3c20a Fix use of relcache TriggerDesc field introduced by commit 05c8482f7f. e4e87a32cc Fix valgrind issue in commit 05c8482f7f. 05c8482f7f Enable parallel SELECT for "INSERT INTO ... SELECT ...". Discussion: https://postgr.es/m/E1lMiB9-0001c3-SY@gemulon.postgresql.org
* Avoid leaking memory in RestoreGUCState(), and improve comments.Tom Lane2021-03-19
| | | | | | | | | | | | | | | | | | | | | | | | RestoreGUCState applied InitializeOneGUCOption to already-live GUC entries, causing any malloc'd subsidiary data to be forgotten. We do want the effect of resetting the GUC to its compiled-in default, and InitializeOneGUCOption seems like the best way to do that, so add code to free any existing subsidiary data beforehand. The interaction between can_skip_gucvar, SerializeGUCState, and RestoreGUCState is way more subtle than their opaque comments would suggest to an unwary reader. Rewrite and enlarge the comments to try to make it clearer what's happening. Remove a long-obsolete assertion in read_nondefault_variables: the behavior of set_config_option hasn't depended on IsInitProcessingMode since f5d9698a8 installed a better way of controlling it. Although this is fixing a clear memory leak, the leak is quite unlikely to involve any large amount of data, and it can only happen once in the lifetime of a worker process. So it seems unnecessary to take any risk of back-patching. Discussion: https://postgr.es/m/4105247.1616174862@sss.pgh.pa.us
* Provide recovery_init_sync_method=syncfs.Thomas Munro2021-03-20
| | | | | | | | | | | | | | | | | | | | | | | | | Since commit 2ce439f3 we have opened every file in the data directory and called fsync() at the start of crash recovery. This can be very slow if there are many files, leading to field complaints of systems taking minutes or even hours to begin crash recovery. Provide an alternative method, for Linux only, where we call syncfs() on every possibly different filesystem under the data directory. This is equivalent, but avoids faulting in potentially many inodes from potentially slow storage. The new mode comes with some caveats, described in the documentation, so the default value for the new setting is "fsync", preserving the older behavior. Reported-by: Michael Brown <michael.brown@discourse.org> Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> Reviewed-by: Paul Guo <guopa@vmware.com> Reviewed-by: Bruce Momjian <bruce@momjian.us> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: David Steele <david@pgmasters.net> Discussion: https://postgr.es/m/11bc2bb7-ecb5-3ad0-b39f-df632734cd81%40discourse.org Discussion: https://postgr.es/m/CAEET0ZHGnbXmi8yF3ywsDZvb3m9CbdsGZgfTXscQ6agcbzcZAw%40mail.gmail.com
* Allow configurable LZ4 TOAST compression.Robert Haas2021-03-19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is now a per-column COMPRESSION option which can be set to pglz (the default, and the only option in up until now) or lz4. Or, if you like, you can set the new default_toast_compression GUC to lz4, and then that will be the default for new table columns for which no value is specified. We don't have lz4 support in the PostgreSQL code, so to use lz4 compression, PostgreSQL must be built --with-lz4. In general, TOAST compression means compression of individual column values, not the whole tuple, and those values can either be compressed inline within the tuple or compressed and then stored externally in the TOAST table, so those properties also apply to this feature. Prior to this commit, a TOAST pointer has two unused bits as part of the va_extsize field, and a compessed datum has two unused bits as part of the va_rawsize field. These bits are unused because the length of a varlena is limited to 1GB; we now use them to indicate the compression type that was used. This means we only have bit space for 2 more built-in compresison types, but we could work around that problem, if necessary, by introducing a new vartag_external value for any further types we end up wanting to add. Hopefully, it won't be too important to offer a wide selection of algorithms here, since each one we add not only takes more coding but also adds a build dependency for every packager. Nevertheless, it seems worth doing at least this much, because LZ4 gets better compression than PGLZ with less CPU usage. It's possible for LZ4-compressed datums to leak into composite type values stored on disk, just as it is for PGLZ. It's also possible for LZ4-compressed attributes to be copied into a different table via SQL commands such as CREATE TABLE AS or INSERT .. SELECT. It would be expensive to force such values to be decompressed, so PostgreSQL has never done so. For the same reasons, we also don't force recompression of already-compressed values even if the target table prefers a different compression method than was used for the source data. These architectural decisions are perhaps arguable but revisiting them is well beyond the scope of what seemed possible to do as part of this project. However, it's relatively cheap to recompress as part of VACUUM FULL or CLUSTER, so this commit adjusts those commands to do so, if the configured compression method of the table happens not to match what was used for some column value stored therein. Dilip Kumar. The original patches on which this work was based were written by Ildus Kurbangaliev, and those were patches were based on even earlier work by Nikita Glukhov, but the design has since changed very substantially, since allow a potentially large number of compression methods that could be added and dropped on a running system proved too problematic given some of the architectural issues mentioned above; the choice of which specific compression method to add first is now different; and a lot of the code has been heavily refactored. More recently, Justin Przyby helped quite a bit with testing and reviewing and this version also includes some code contributions from him. Other design input and review from Tomas Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander Korotkov, and me. Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com
* Don't leak malloc'd strings when a GUC setting is rejected.Tom Lane2021-03-18
| | | | | | | | | | | | | | Because guc.c prefers to keep all its string values in malloc'd not palloc'd storage, it has to be more careful than usual to avoid leaks. Error exits out of string GUC hook checks failed to clear the proposed value string, and error exits out of ProcessGUCArray() failed to clear the malloc'd results of ParseLongOption(). Found via valgrind testing. This problem is ancient, so back-patch to all supported branches. Discussion: https://postgr.es/m/3816764.1616104288@sss.pgh.pa.us
* Remove temporary files after backend crashTomas Vondra2021-03-18
| | | | | | | | | | | | | | | | | | After a crash of a backend using temporary files, the files used to be left behind, on the basis that it might be useful for debugging. But we don't have any reports of anyone actually doing that, and it means the disk usage may grow over time due to repeated backend failures (possibly even hitting ENOSPC). So this behavior is a bit unfortunate, and fixing it required either manual cleanup (deleting files, which is error-prone) or restart of the instance (i.e. service disruption). This implements automatic cleanup of temporary files, controled by a new GUC remove_temp_files_after_crash. By default the files are removed, but it can be disabled to restore the old behavior if needed. Author: Euler Taveira Reviewed-by: Tomas Vondra, Michael Paquier, Anastasia Lubennikova, Thomas Munro Discussion: https://postgr.es/m/CAH503wDKdYzyq7U-QJqGn%3DGm6XmoK%2B6_6xTJ-Yn5WSvoHLY1Ww%40mail.gmail.com
* Add a new GUC and a reloption to enable inserts in parallel-mode.Amit Kapila2021-03-18
| | | | | | | | | | | | | | | | | | | | | Commit 05c8482f7f added the implementation of parallel SELECT for "INSERT INTO ... SELECT ..." which may incur non-negligible overhead in the additional parallel-safety checks that it performs, even when, in the end, those checks determine that parallelism can't be used. This is normally only ever a problem in the case of when the target table has a large number of partitions. A new GUC option "enable_parallel_insert" is added, to allow insert in parallel-mode. The default is on. In addition to the GUC option, the user may want a mechanism to allow inserts in parallel-mode with finer granularity at table level. The new table option "parallel_insert_enabled" allows this. The default is true. Author: "Hou, Zhijie" Reviewed-by: Greg Nancarrow, Amit Langote, Takayuki Tsunakawa, Amit Kapila Discussion: https://postgr.es/m/CAA4eK1K-cW7svLC2D7DHoGHxdAdg3P37BLgebqBOC2ZLc9a6QQ%40mail.gmail.com Discussion: https://postgr.es/m/CAJcOf-cXnB5cnMKqWEp2E2z7Mvcd04iLVmV=qpFJrR3AcrTS3g@mail.gmail.com
* Don't consider newly inserted tuples in nbtree VACUUM.Peter Geoghegan2021-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the entire idea of "stale stats" within nbtree VACUUM (stop caring about stats involving the number of inserted tuples). Also remove the vacuum_cleanup_index_scale_factor GUC/param on the master branch (though just disable them on postgres 13). The vacuum_cleanup_index_scale_factor/stats interface made the nbtree AM partially responsible for deciding when pg_class.reltuples stats needed to be updated. This seems contrary to the spirit of the index AM API, though -- it is not actually necessary for an index AM's bulk delete and cleanup callbacks to provide accurate stats when it happens to be inconvenient. The core code owns that. (Index AMs have the authority to perform or not perform certain kinds of deferred cleanup based on their own considerations, such as page deletion and recycling, but that has little to do with pg_class.reltuples/num_index_tuples.) This issue was fairly harmless until the introduction of the autovacuum_vacuum_insert_threshold feature by commit b07642db, which had an undesirable interaction with the vacuum_cleanup_index_scale_factor mechanism: it made insert-driven autovacuums perform full index scans, even though there is no real benefit to doing so. This has been tied to a regression with an append-only insert benchmark [1]. Also have remaining cases that perform a full scan of an index during a cleanup-only nbtree VACUUM indicate that the final tuple count is only an estimate. This prevents vacuumlazy.c from setting the index's pg_class.reltuples in those cases (it will now only update pg_class when vacuumlazy.c had TIDs for nbtree to bulk delete). This arguably fixes an oversight in deduplication-related bugfix commit 48e12913. [1] https://smalldatum.blogspot.com/2021/01/insert-benchmark-postgres-is-still.html Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAD21AoA4WHthN5uU6+WScZ7+J_RcEjmcuH94qcoUPuB42ShXzg@mail.gmail.com Backpatch: 13-, where autovacuum_vacuum_insert_threshold was added.
* Track total amounts of times spent writing and syncing WAL data to disk.Fujii Masao2021-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds new GUC track_wal_io_timing. When this is enabled, the total amounts of time XLogWrite writes and issue_xlog_fsync syncs WAL data to disk are counted in pg_stat_wal. This information would be useful to check how much WAL write and sync affect the performance. Enabling track_wal_io_timing will make the server query the operating system for the current time every time WAL is written or synced, which may cause significant overhead on some platforms. To avoid such additional overhead in the server with track_io_timing enabled, this commit introduces track_wal_io_timing as a separate parameter from track_io_timing. Note that WAL write and sync activity by walreceiver has not been tracked yet. This commit makes the server also track the numbers of times XLogWrite writes and issue_xlog_fsync syncs WAL data to disk, in pg_stat_wal, regardless of the setting of track_wal_io_timing. This counters can be used to calculate the WAL write and sync time per request, for example. Bump PGSTAT_FILE_FORMAT_ID. Bump catalog version. Author: Masahiro Ikeda Reviewed-By: Japin Li, Hayato Kuroda, Masahiko Sawada, David Johnston, Fujii Masao Discussion: https://postgr.es/m/0509ad67b585a5b86a83d445dfa75392@oss.nttdata.com
* Remove server and libpq support for old FE/BE protocol version 2.Heikki Linnakangas2021-03-04
| | | | | | | | | | | | | | | | | Protocol version 3 was introduced in PostgreSQL 7.4. There shouldn't be many clients or servers left out there without version 3 support. But as a courtesy, I kept just enough of the old protocol support that we can still send the "unsupported protocol version" error in v2 format, so that old clients can display the message properly. Likewise, libpq still understands v2 ErrorResponse messages when establishing a connection. The impetus to do this now is that I'm working on a patch to COPY FROM, to always prefetch some data. We cannot do that safely with the old protocol, because it requires parsing the input one byte at a time to detect the end-of-copy marker. Reviewed-by: Tom Lane, Alvaro Herrera, John Naylor Discussion: https://www.postgresql.org/message-id/9ec25819-0a8a-d51a-17dc-4150bb3cca3b%40iki.fi
* Some copy-editing of GUC descriptionsPeter Eisentraut2021-03-03
|
* Mark default_transaction_read_only as GUC_REPORT.Tom Lane2021-03-02
| | | | | | | | | | | | | | This allows clients to find out the setting at connection time without having to expend a query round trip to do so; which is helpful when trying to identify read/write servers. (One must also look at in_hot_standby, but that's already GUC_REPORT, cf bf8a662c9.) Modifying libpq to make use of this will come soon, but I felt it cleaner to push the server change separately. Haribabu Kommi, Greg Nancarrow, Vignesh C; reviewed at various times by Laurenz Albe, Takayuki Tsunakawa, Peter Smith. Discussion: https://postgr.es/m/CAF3+xM+8-ztOkaV9gHiJ3wfgENTq97QcjXQt+rbFQ6F7oNzt9A@mail.gmail.com
* VACUUM: ignore indexing operations with CONCURRENTLYAlvaro Herrera2021-02-23
| | | | | | | | | | | | | | | | | | As envisioned in commit c98763bf51bf, it is possible for VACUUM to ignore certain transactions that are executing CREATE INDEX CONCURRENTLY and REINDEX CONCURRENTLY for the purposes of computing Xmin; that's because we know those transactions are not going to examine any other tables, and are not going to execute anything else in the same transaction. (Only operations on "safe" indexes can be ignored: those on indexes that are neither partial nor expressional). This is extremely useful in cases where CIC/RC can run for a very long time, because that used to be a significant headache for concurrent vacuuming of other tables. Reviewed-by: Matthias van de Meent <boekewurm+postgres@gmail.com> Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/20210115133858.GA18931@alvherre.pgsql
* Remove outdated reference to RAID spindles.Thomas Munro2021-02-22
| | | | | | | | | | | Commit b09ff536 left behind some outdated advice in the long_desc field of the GUC "effective_io_concurrency". Remove it. Back-patch to 13. Reported-by: Andrew Gierth <andrew@tao11.riddles.org.uk> Reviewed-by: Julien Rouhaud <rjuju123@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGJyyWqFBxL9gEj-qtjBThGjhAOBE8GBnF8MUJOJ3vrfag%40mail.gmail.com
* Allow specifying CRL directoryPeter Eisentraut2021-02-18
| | | | | | | | | | | | Add another method to specify CRLs, hashed directory method, for both server and client side. This offers a means for server or libpq to load only CRLs that are required to verify a certificate. The CRL directory is specifed by separate GUC variables or connection options ssl_crl_dir and sslcrldir, alongside the existing ssl_crl_file and sslcrl, so both methods can be used at the same time. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/20200731.173911.904649928639357911.horikyota.ntt@gmail.com
* Default to wal_sync_method=fdatasync on FreeBSD.Thomas Munro2021-02-15
| | | | | | | | | | | | FreeBSD 13 gained O_DSYNC, which would normally cause wal_sync_method to choose open_datasync as its default value. That may not be a good choice for all systems, and performs worse than fdatasync in some scenarios. Let's preserve the existing default behavior for now. Like commit 576477e73c4, which did the same for Linux, back-patch to all supported releases. Discussion: https://postgr.es/m/CA%2BhUKGLsAMXBQrCxCXoW-JsUYmdOL8ALYvaX%3DCrHqWxm-nWbGA%40mail.gmail.com
* Reduce the default value of vacuum_cost_page_miss.Peter Geoghegan2021-01-27
| | | | | | | | | | | | | | | | | | | | | | | | | When commit f425b605 introduced cost based vacuum delays back in 2004, the defaults reflected then-current trends in hardware, as well as certain historical limitations in PostgreSQL. There have been enormous improvements in both areas since that time. The cost limit GUC defaults finally became much more representative of current trends following commit cbccac37, which decreased autovacuum_vacuum_cost_delay's default by 10x for PostgreSQL 12 (it went from 20ms to only 2ms). The relative costs have shifted too. This should also be accounted for by the defaults. More specifically, the relative importance of avoiding dirtying pages within VACUUM has greatly increased, primarily due to main memory capacity scaling and trends in flash storage. Within Postgres itself, improvements like sequential access during index vacuuming (at least in nbtree and GiST indexes) have also been contributing factors. To reflect all this, decrease the default of vacuum_cost_page_miss to 2. Since the default of vacuum_cost_page_dirty remains 20, dirtying a page is now considered 10x "costlier" than a page miss by default. Author: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmLPFnkWT8xMjmcsm7YS3+_Qi3iRWAb2+_Bc8UhVyHfuA@mail.gmail.com
* Add GUC to log long wait times on recovery conflicts.Fujii Masao2021-01-08
| | | | | | | | | | | | | | | | This commit adds GUC log_recovery_conflict_waits that controls whether a log message is produced when the startup process is waiting longer than deadlock_timeout for recovery conflicts. This is useful in determining if recovery conflicts prevent the recovery from applying WAL. Note that currently a log message is produced only when recovery conflict has not been resolved yet even after deadlock_timeout passes, i.e., only when the startup process is still waiting for recovery conflict even after deadlock_timeout. Author: Bertrand Drouvot, Masahiko Sawada Reviewed-by: Alvaro Herrera, Kyotaro Horiguchi, Fujii Masao Discussion: https://postgr.es/m/9a60178c-a853-1440-2cdc-c3af916cff59@amazon.com
* Improve commentary in timeout.c.Tom Lane2021-01-06
| | | | | | | On re-reading I realized that I'd missed one race condition in the new timeout code. It's safe, but add a comment explaining it. Discussion: https://postgr.es/m/CA+hUKG+o6pbuHBJSGnud=TadsuXySWA7CCcPgCt2QE9F6_4iHQ@mail.gmail.com
* Add idle_session_timeout.Tom Lane2021-01-06
| | | | | | | | | | | | This GUC variable works much like idle_in_transaction_session_timeout, in that it kills sessions that have waited too long for a new client query. But it applies when we're not in a transaction, rather than when we are. Li Japin, reviewed by David Johnston and Hayato Kuroda, some fixes by me Discussion: https://postgr.es/m/763A0689-F189-459E-946F-F0EC4458980B@hotmail.com
* Improve timeout.c's handling of repeated timeout set/cancel.Tom Lane2021-01-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A very common usage pattern is that we set a timeout that we don't expect to reach, cancel it after a little bit, and later repeat. With the original implementation of timeout.c, this results in one setitimer() call per timeout set or cancel. We can do a lot better by being lazy about changing the timeout interrupt request, namely: (1) never cancel the outstanding interrupt, even when we have no active timeout events; (2) if we need to set an interrupt, but there already is one pending at or before the required time, leave it alone. When the interrupt happens, the signal handler will reschedule it at whatever time is then needed. For example, with a one-second setting for statement_timeout, this method results in having to interact with the kernel only a little more than once a second, no matter how many statements we execute in between. The mainline code might never call setitimer() at all after the first time, while each time the signal handler fires, it sees that the then-pending request is most of a second away, and that's when it sets the next interrupt request for. Each mainline timeout-set request after that will observe that the time it wants is past the pending interrupt request time, and do nothing. This also works pretty well for cases where a few different timeout lengths are in use, as long as none of them are very short. But that describes our usage well. Idea and original patch by Thomas Munro; I fixed a race condition and improved the comments. Discussion: https://postgr.es/m/CA+hUKG+o6pbuHBJSGnud=TadsuXySWA7CCcPgCt2QE9F6_4iHQ@mail.gmail.com