aboutsummaryrefslogtreecommitdiff
path: root/src/backend/executor/execProcnode.c
Commit message (Collapse)AuthorAge
* Update copyright for 2025Bruce Momjian2025-01-01
| | | | Backpatch-through: 13
* Evaluate arguments of correlated SubPlans in the referencing ExprStateAndres Freund2024-07-31
| | | | | | | | | | | | | | | | | | | | | | | Until now we generated an ExprState for each parameter to a SubPlan and evaluated them one-by-one ExecScanSubPlan. That's sub-optimal as creating lots of small ExprStates a) makes JIT compilation more expensive b) wastes memory c) is a bit slower to execute This commit arranges to evaluate parameters to a SubPlan as part of the ExprState referencing a SubPlan, using the new EEOP_PARAM_SET expression step. We emit one EEOP_PARAM_SET for each argument to a subplan, just before the EEOP_SUBPLAN step. It likely is worth using EEOP_PARAM_SET in other places as well, e.g. for SubPlan outputs, nestloop parameters and - more ambitiously - to get rid of ExprContext->domainValue/caseValue/ecxt_agg*. But that's for later. Author: Andres Freund <andres@anarazel.de> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Alena Rybakina <lena.ribackina@yandex.ru> Discussion: https://postgr.es/m/20230225214401.346ancgjqc3zmvek@awork3.anarazel.de
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Remove obsolete executor cleanup codeAmit Langote2023-09-28
| | | | | | | | | | | | | | | | | | | This commit removes unnecessary ExecExprFreeContext() calls in ExecEnd* routines because the actual cleanup is managed by FreeExecutorState(). With no callers remaining for ExecExprFreeContext(), this commit also removes the function. This commit also drops redundant ExecClearTuple() calls, because ExecResetTupleTable() in ExecEndPlan() already takes care of resetting and dropping all TupleTableSlots initialized with ExecInitScanTupleSlot() and ExecInitExtraTupleSlot(). After these modifications, the ExecEnd*() routines for ValuesScan, NamedTuplestoreScan, and WorkTableScan became redundant. So, this commit removes them. Reviewed-by: Robert Haas Discussion: https://postgr.es/m/CA+HiwqFGkMSge6TgC9KQzde0ohpAycLQuV7ooitEEpbKB0O_mg@mail.gmail.com
* Update copyright for 2023Bruce Momjian2023-01-02
| | | | Backpatch-through: 11
* Future-proof the recursion inside ExecShutdownNode().Tom Lane2022-09-19
| | | | | | | | | | | | | | | | | | | | | | | The API contract for planstate_tree_walker() callbacks is that they take a PlanState pointer and a context pointer. Somebody figured they could save a couple lines of code by ignoring that, and passing ExecShutdownNode itself as the walker even though it has but one argument. Somewhat remarkably, we've gotten away with that so far. However, it seems clear that the upcoming C2x standard means to forbid such cases, and compilers that actively break such code likely won't be far behind. So spend the extra few lines of code to do it honestly with a separate walker function. In HEAD, we might as well go further and remove ExecShutdownNode's useless return value. I left that as-is in back branches though, to forestall complaints about ABI breakage. Back-patch, with the thought that this might become of practical importance before our stable branches are all out of service. It doesn't seem to be fixing any live bug on any currently known platform, however. Discussion: https://postgr.es/m/208054.1663534665@sss.pgh.pa.us
* Update copyright for 2022Bruce Momjian2022-01-07
| | | | Backpatch-through: 10
* Change the name of the Result Cache node to MemoizeDavid Rowley2021-07-14
| | | | | | | | | | | "Result Cache" was never a great name for this node, but nobody managed to come up with another name that anyone liked enough. That was until David Johnston mentioned "Node Memoization", which Tom Lane revised to just "Memoize". People seem to like "Memoize", so let's do the rename. Reviewed-by: Justin Pryzby Discussion: https://postgr.es/m/20210708165145.GG1176@momjian.us Backpatch-through: 14, where Result Cache was introduced
* Fix EXPLAIN ANALYZE for async-capable nodes.Etsuro Fujita2021-05-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | EXPLAIN ANALYZE for an async-capable ForeignScan node associated with postgres_fdw is done just by using instrumentation for ExecProcNode() called from the node's callbacks, causing the following problems: 1) If the remote table to scan is empty, the node is incorrectly considered as "never executed" by the command even if the node is executed, as ExecProcNode() isn't called from the node's callbacks at all in that case. 2) The command fails to collect timings for things other than ExecProcNode() done in the node, such as creating a cursor for the node's remote query. To fix these problems, add instrumentation for async-capable nodes, and modify postgres_fdw accordingly. My oversight in commit 27e1f1456. While at it, update a comment for the AsyncRequest struct in execnodes.h and the documentation for the ForeignAsyncRequest API in fdwhandler.sgml to match the code in ExecAsyncAppendResponse() in nodeAppend.c, and fix typos in comments in nodeAppend.c. Per report from Andrey Lepikhov, though I didn't use his patch. Reviewed-by: Andrey Lepikhov Discussion: https://postgr.es/m/2eb662bb-105d-fc20-7412-2f027cc3ca72%40postgrespro.ru
* Add Result Cache executor node (take 2)David Rowley2021-04-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Revert b6002a796David Rowley2021-04-01
| | | | | | | | | | | | | This removes "Add Result Cache executor node". It seems that something weird is going on with the tracking of cache hits and misses as highlighted by many buildfarm animals. It's not yet clear what the problem is as other parts of the plan indicate that the cache did work correctly, it's just the hits and misses that were being reported as 0. This is especially a bad time to have the buildfarm so broken, so reverting before too many more animals go red. Discussion: https://postgr.es/m/CAApHDvq_hydhfovm4=izgWs+C5HqEeRScjMbOgbpC-jRAeK3Yw@mail.gmail.com
* Add Result Cache executor nodeDavid Rowley2021-04-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
* Add TID Range Scans to support efficient scanning ranges of TIDsDavid Rowley2021-02-27
| | | | | | | | | | | | | | | | | | | | | This adds a new executor node named TID Range Scan. The query planner will generate paths for TID Range scans when quals are discovered on base relations which search for ranges on the table's ctid column. These ranges may be open at either end. For example, WHERE ctid >= '(10,0)'; will return all tuples on page 10 and over. To support this, two new optional callback functions have been added to table AM. scan_set_tidrange is used to set the scan range to just the given range of TIDs. scan_getnextslot_tidrange fetches the next tuple in the given range. For AMs were scanning ranges of TIDs would not make sense, these functions can be set to NULL in the TableAmRoutine. The query planner won't generate TID Range Scan Paths in that case. Author: Edmund Horner, David Rowley Reviewed-by: David Rowley, Tomas Vondra, Tom Lane, Andres Freund, Zhihong Yu Discussion: https://postgr.es/m/CAMyN-kB-nFTkF=VA_JPwFNo08S0d-Yk0F741S2B7LDmYAi8eyA@mail.gmail.com
* Update copyright for 2021Bruce Momjian2021-01-02
| | | | Backpatch-through: 9.5
* Fix buffer usage stats for nodes above Gather Merge.Amit Kapila2020-07-25
| | | | | | | | | | | | | | | Commit 85c9d347 addressed a similar problem for Gather and Gather Merge nodes but forgot to account for nodes above parallel nodes. This still works for nodes above Gather node because we shut down the workers for Gather node as soon as there are no more tuples. We can do a similar thing for Gather Merge as well but it seems better to account for stats during nodes shutdown after completing the execution. Reported-by: Stéphane Lorek, Jehan-Guillaume de Rorthais Author: Jehan-Guillaume de Rorthais <jgdr@dalibo.com> Reviewed-by: Amit Kapila Backpatch-through: 10, where it was introduced Discussion: https://postgr.es/m/20200718160206.584532a2@firost
* Implement Incremental SortTomas Vondra2020-04-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Incremental Sort is an optimized variant of multikey sort for cases when the input is already sorted by a prefix of the requested sort keys. For example when the relation is already sorted by (key1, key2) and we need to sort it by (key1, key2, key3) we can simply split the input rows into groups having equal values in (key1, key2), and only sort/compare the remaining column key3. This has a number of benefits: - Reduced memory consumption, because only a single group (determined by values in the sorted prefix) needs to be kept in memory. This may also eliminate the need to spill to disk. - Lower startup cost, because Incremental Sort produce results after each prefix group, which is beneficial for plans where startup cost matters (like for example queries with LIMIT clause). We consider both Sort and Incremental Sort, and decide based on costing. The implemented algorithm operates in two different modes: - Fetching a minimum number of tuples without check of equality on the prefix keys, and sorting on all columns when safe. - Fetching all tuples for a single prefix group and then sorting by comparing only the remaining (non-prefix) keys. We always start in the first mode, and employ a heuristic to switch into the second mode if we believe it's beneficial - the goal is to minimize the number of unnecessary comparions while keeping memory consumption below work_mem. This is a very old patch series. The idea was originally proposed by Alexander Korotkov back in 2013, and then revived in 2017. In 2018 the patch was taken over by James Coleman, who wrote and rewrote most of the current code. There were many reviewers/contributors since 2013 - I've done my best to pick the most active ones, and listed them in this commit message. Author: James Coleman, Alexander Korotkov Reviewed-by: Tomas Vondra, Andreas Karlsson, Marti Raudsepp, Peter Geoghegan, Robert Haas, Thomas Munro, Antonin Houska, Andres Freund, Alexander Kuzmenkov Discussion: https://postgr.es/m/CAPpHfdscOX5an71nHd8WSUH6GNOCf=V7wgDaTXdDd9=goN-gfA@mail.gmail.com Discussion: https://postgr.es/m/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com
* Update copyrights for 2020Bruce Momjian2020-01-01
| | | | Backpatch-through: update all files in master, backpatch legal files through 9.4
* Make the order of the header file includes consistent in backend modules.Amit Kapila2019-11-12
| | | | | | | | | | | Similar to commits 7e735035f2 and dddf4cdc33, this commit makes the order of header file inclusion consistent for backend modules. In the passing, removed a couple of duplicate inclusions. Author: Vignesh C Reviewed-by: Kuntal Ghosh and Amit Kapila Discussion: https://postgr.es/m/CALDaNm2Sznv8RR6Ex-iJO6xAdsxgWhCoETkaYX=+9DW3q0QCfA@mail.gmail.com
* Use Append rather than MergeAppend for scanning ordered partitions.Tom Lane2019-04-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we need ordered output from a scan of a partitioned table, but the ordering matches the partition ordering, then we don't need to use a MergeAppend to combine the pre-ordered per-partition scan results: a plain Append will produce the same results. This both saves useless comparison work inside the MergeAppend proper, and allows us to start returning tuples after istarting up just the first child node not all of them. However, all is not peaches and cream, because if some of the child nodes have high startup costs then there will be big discontinuities in the tuples-returned-versus-elapsed-time curve. The planner's cost model cannot handle that (yet, anyway). If we model the Append's startup cost as being just the first child's startup cost, we may drastically underestimate the cost of fetching slightly more tuples than are available from the first child. Since we've had bad experiences with over-optimistic choices of "fast start" plans for ORDER BY LIMIT queries, that seems scary. As a klugy workaround, set the startup cost estimate for an ordered Append to be the sum of its children's startup costs (as MergeAppend would). This doesn't really describe reality, but it's less likely to cause a bad plan choice than an underestimated startup cost would. In practice, the cases where we really care about this optimization will have child plans that are IndexScans with zero startup cost, so that the overly conservative estimate is still just zero. David Rowley, reviewed by Julien Rouhaud and Antonin Houska Discussion: https://postgr.es/m/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com
* Update copyright for 2019Bruce Momjian2019-01-02
| | | | Backpatch-through: certain files through 9.4
* Adjust comment atop ExecShutdownNode.Amit Kapila2018-08-13
| | | | | | | | | After commits a315b967cc and b805b63ac2, part of the comment atop ExecShutdownNode is redundant. Adjust it. Author: Amit Kapila Backpatch-through: 10 where both the mentioned commits are present. Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
* Fix buffer usage stats for parallel nodes.Amit Kapila2018-08-03
| | | | | | | | | | | | | | | | | | | | | | | | The buffer usage stats is accounted only for the execution phase of the node. For Gather and Gather Merge nodes, such stats are accumulated at the time of shutdown of workers which is done after execution of node due to which we missed to account them for such nodes. Fix it by treating nodes as running while we shut down them. We can also miss accounting for a Limit node when Gather or Gather Merge is beneath it, because it can finish the execution before shutting down such nodes. So we allow a Limit node to shut down the resources before it completes the execution. In the passing fix the gather node code to allow workers to shut down as soon as we find that all the tuples from the workers have been retrieved. The original code use to do that, but is accidently removed by commit 01edb5c7fc. Reported-by: Adrien Nayrat Author: Amit Kapila and Robert Haas Reviewed-by: Robert Haas and Andres Freund Backpatch-through: 9.6 where this code was introduced Discussion: https://postgr.es/m/86137f17-1dfb-42f9-7421-82fd786b04a1@anayrat.info
* Post-feature-freeze pgindent run.Tom Lane2018-04-26
| | | | Discussion: https://postgr.es/m/15719.1523984266@sss.pgh.pa.us
* Fix a boatload of typos in C comments.Tom Lane2018-04-01
| | | | | | Justin Pryzby Discussion: https://postgr.es/m/20180331105640.GK28454@telsasoft.com
* Update copyright for 2018Bruce Momjian2018-01-02
| | | | Backpatch-through: certain files through 9.3
* Add parallel-aware hash joins.Andres Freund2017-12-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce parallel-aware hash joins that appear in EXPLAIN plans as Parallel Hash Join with Parallel Hash. While hash joins could already appear in parallel queries, they were previously always parallel-oblivious and had a partial subplan only on the outer side, meaning that the work of the inner subplan was duplicated in every worker. After this commit, the planner will consider using a partial subplan on the inner side too, using the Parallel Hash node to divide the work over the available CPU cores and combine its results in shared memory. If the join needs to be split into multiple batches in order to respect work_mem, then workers process different batches as much as possible and then work together on the remaining batches. The advantages of a parallel-aware hash join over a parallel-oblivious hash join used in a parallel query are that it: * avoids wasting memory on duplicated hash tables * avoids wasting disk space on duplicated batch files * divides the work of building the hash table over the CPUs One disadvantage is that there is some communication between the participating CPUs which might outweigh the benefits of parallelism in the case of small hash tables. This is avoided by the planner's existing reluctance to supply partial plans for small scans, but it may be necessary to estimate synchronization costs in future if that situation changes. Another is that outer batch 0 must be written to disk if multiple batches are required. A potential future advantage of parallel-aware hash joins is that right and full outer joins could be supported, since there is a single set of matched bits for each hashtable, but that is not yet implemented. A new GUC enable_parallel_hash is defined to control the feature, defaulting to on. Author: Thomas Munro Reviewed-By: Andres Freund, Robert Haas Tested-By: Rafia Sabih, Prabhat Sahu Discussion: https://postgr.es/m/CAEepm=2W=cOkiZxcg6qiFQP-dHUe09aqTrEMM7yJDrHMhDv_RA@mail.gmail.com https://postgr.es/m/CAEepm=37HKyJ4U6XOLi=JgfSHM3o6B-GaeO-6hkOmneTDkH+Uw@mail.gmail.com
* Allow executor nodes to change their ExecProcNode function.Andres Freund2017-12-13
| | | | | | | | | | | | | | In order for executor nodes to be able to change their ExecProcNode function after ExecInitNode() has finished, provide ExecSetExecProcNode(). This allows any wrappers functions that only execProcnode.c knows about to be reinstalled. The motivation for wanting to change ExecProcNode after ExecInitNode() has finished is that it is not known until later whether parallel query is available, so if a parallel variant is to be installed then ExecInitNode() is too soon to decide. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=09rr65VN+cAV5FgyM_z=D77Xy8Fuc9CDDDYbq3pQUezg@mail.gmail.com
* Fix EXPLAIN ANALYZE of hash join when the leader doesn't participate.Andres Freund2017-12-05
| | | | | | | | | | | | | | | | | | | | If a hash join appears in a parallel query, there may be no hash table available for explain.c to inspect even though a hash table may have been built in other processes. This could happen either because parallel_leader_participation was set to off or because the leader happened to hit the end of the outer relation immediately (even though the complete relation is not empty) and decided not to build the hash table. Commit bf11e7ee introduced a way for workers to exchange instrumentation via the DSM segment for Sort nodes even though they are not parallel-aware. This commit does the same for Hash nodes, so that explain.c has a way to find instrumentation data from an arbitrary participant that actually built the hash table. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm%3D3DUQC2-z252N55eOcZBer6DPdM%3DFzrxH9dZc5vYLsjaA%40mail.gmail.com
* Push tuple limits through Gather and Gather Merge.Robert Haas2017-08-29
| | | | | | | | | | | | If we only need, say, 10 tuples in total, then we certainly don't need more than 10 tuples from any single process. Pushing down the limit lets workers exit early when possible. For Gather Merge, there is an additional benefit: a Sort immediately below the Gather Merge can be done as a bounded sort if there is an applicable limit. Robert Haas and Tom Lane Discussion: http://postgr.es/m/CA+TgmoYa3QKKrLj5rX7UvGqhH73G1Li4B-EKxrmASaca2tFu9Q@mail.gmail.com
* Final pgindent + perltidy run for v10.Tom Lane2017-08-14
|
* Move ExecProcNode from dispatch to function pointer based model.Andres Freund2017-07-30
| | | | | | | | | | | | | | | | | | | | | | This allows us to add stack-depth checks the first time an executor node is called, and skip that overhead on following calls. Additionally it yields a nice speedup. While it'd probably have been a good idea to have that check all along, it has become more important after the new expression evaluation framework in b8d7f053c5c2bf2a7e - there's no stack depth check in common paths anymore now. We previously relied on ExecEvalExpr() being executed somewhere. We should move towards that model for further routines, but as this is required for v10, it seems better to only do the necessary (which already is quite large). Author: Andres Freund, Tom Lane Reported-By: Julien Rouhaud Discussion: https://postgr.es/m/22833.1490390175@sss.pgh.pa.us https://postgr.es/m/b0af9eaa-130c-60d0-9e4e-7a135b1e0c76@dalibo.com
* Move interrupt checking from ExecProcNode() to executor nodes.Andres Freund2017-07-30
| | | | | | | | | | | | | | | | | In a followup commit ExecProcNode(), and especially the large switch it contains, will largely be replaced by a function pointer directly to the correct node. The node functions will then get invoked by a thin inline function wrapper. To avoid having to include miscadmin.h in headers - CHECK_FOR_INTERRUPTS() - move the interrupt checks into the individual executor routines. While looking through all executor nodes, I noticed a number of arguably missing interrupt checks, add these too. Author: Andres Freund, Tom Lane Reviewed-By: Tom Lane Discussion: https://postgr.es/m/22833.1490390175@sss.pgh.pa.us
* Phase 3 of pgindent updates.Tom Lane2017-06-21
| | | | | | | | | | | | | | | | | | | | | | | | | Don't move parenthesized lines to the left, even if that means they flow past the right margin. By default, BSD indent lines up statement continuation lines that are within parentheses so that they start just to the right of the preceding left parenthesis. However, traditionally, if that resulted in the continuation line extending to the right of the desired right margin, then indent would push it left just far enough to not overrun the margin, if it could do so without making the continuation line start to the left of the current statement indent. That makes for a weird mix of indentations unless one has been completely rigid about never violating the 80-column limit. This behavior has been pretty universally panned by Postgres developers. Hence, disable it with indent's new -lpl switch, so that parenthesized lines are always lined up with the preceding left paren. This patch is much less interesting than the first round of indent changes, but also bulkier, so I thought it best to separate the effects. Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
* Post-PG 10 beta1 pgindent runBruce Momjian2017-05-17
| | | | perltidy run not included.
* Add infrastructure to support EphemeralNamedRelation references.Kevin Grittner2017-03-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A QueryEnvironment concept is added, which allows new types of objects to be passed into queries from parsing on through execution. At this point, the only thing implemented is a collection of EphemeralNamedRelation objects -- relations which can be referenced by name in queries, but do not exist in the catalogs. The only type of ENR implemented is NamedTuplestore, but provision is made to add more types fairly easily. An ENR can carry its own TupleDesc or reference a relation in the catalogs by relid. Although these features can be used without SPI, convenience functions are added to SPI so that ENRs can easily be used by code run through SPI. The initial use of all this is going to be transition tables in AFTER triggers, but that will be added to each PL as a separate commit. An incidental effect of this patch is to produce a more informative error message if an attempt is made to modify the contents of a CTE from a referencing DML statement. No tests previously covered that possibility, so one is added. Kevin Grittner and Thomas Munro Reviewed by Heikki Linnakangas, David Fetter, and Thomas Munro with valuable comments and suggestions from many others
* Add a Gather Merge executor node.Robert Haas2017-03-09
| | | | | | | | | | | | | | | | | | | | Like Gather, we spawn multiple workers and run the same plan in each one; however, Gather Merge is used when each worker produces the same output ordering and we want to preserve that output ordering while merging together the streams of tuples from various workers. (In a way, Gather Merge is like a hybrid of Gather and MergeAppend.) This works out to a win if it saves us from having to perform an expensive Sort. In cases where only a small amount of data would need to be sorted, it may actually be faster to use a regular Gather node and then sort the results afterward, because Gather Merge sometimes needs to wait synchronously for tuples whereas a pure Gather generally doesn't. But if this avoids an expensive sort then it's a win. Rushabh Lathia, reviewed and tested by Amit Kapila, Thomas Munro, and Neha Sharma, and reviewed and revised by me. Discussion: http://postgr.es/m/CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com
* Support XMLTABLE query expressionAlvaro Herrera2017-03-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | XMLTABLE is defined by the SQL/XML standard as a feature that allows turning XML-formatted data into relational form, so that it can be used as a <table primary> in the FROM clause of a query. This new construct provides significant simplicity and performance benefit for XML data processing; what in a client-side custom implementation was reported to take 20 minutes can be executed in 400ms using XMLTABLE. (The same functionality was said to take 10 seconds using nested PostgreSQL XPath function calls, and 5 seconds using XMLReader under PL/Python). The implemented syntax deviates slightly from what the standard requires. First, the standard indicates that the PASSING clause is optional and that multiple XML input documents may be given to it; we make it mandatory and accept a single document only. Second, we don't currently support a default namespace to be specified. This implementation relies on a new executor node based on a hardcoded method table. (Because the grammar is fixed, there is no extensibility in the current approach; further constructs can be implemented on top of this such as JSON_TABLE, but they require changes to core code.) Author: Pavel Stehule, Álvaro Herrera Extensively reviewed by: Craig Ringer Discussion: https://postgr.es/m/CAFj8pRAgfzMD-LoSmnMGybD0WsEznLHWap8DO79+-GTRAPR4qA@mail.gmail.com
* Allow custom and foreign scans to have shutdown callbacks.Robert Haas2017-02-26
| | | | | | | | | | | | This is expected to be useful mostly when performing such scans in parallel, because in that case it allows (in combination with commit acf555bc53acb589b5a2827e65d655fa8c9adee0) nodes below a Gather to get control just before the DSM segment goes away. KaiGai Kohei, except that I rewrote the documentation. Reviewed by Claudio Freire. Discussion: http://postgr.es/m/CADyhKSXJK0jUJ8rWv4AmKDhsUh124_rEn39eqgfC5D8fu6xVuw@mail.gmail.com
* Shut down Gather's children before shutting down Gather itself.Robert Haas2017-02-22
| | | | | | | | | | | | | It turns out that the original shutdown order here does not work well. Multiple people attempting to develop further parallel query patches have discovered that they need to do cleanup before the DSM goes away, and you can't do that if the parent node gets cleaned up first. Patch by me, reviewed by KaiGai Kohei and Dilip Kumar. Discussion: http://postgr.es/m/CA+TgmoY6bOc1YnhcAQnMfCBDbsJzROQ3sYxSAL-SYB5tMJcTKg@mail.gmail.com Discussion: http://postgr.es/m/9A28C8860F777E439AA12E8AEA7694F8012AEB82@BPXM15GP.gisp.nec.co.jp Discussion: http://postgr.es/m/CA+TgmoYuPOc=+xrG1v0fCsoLbKAab9F1ddOeaaiLMzKOiBar1Q@mail.gmail.com
* Move targetlist SRF handling from expression evaluation to new executor node.Andres Freund2017-01-18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Evaluation of set returning functions (SRFs_ in the targetlist (like SELECT generate_series(1,5)) so far was done in the expression evaluation (i.e. ExecEvalExpr()) and projection (i.e. ExecProject/ExecTargetList) code. This meant that most executor nodes performing projection, and most expression evaluation functions, had to deal with the possibility that an evaluated expression could return a set of return values. That's bad because it leads to repeated code in a lot of places. It also, and that's my (Andres's) motivation, made it a lot harder to implement a more efficient way of doing expression evaluation. To fix this, introduce a new executor node (ProjectSet) that can evaluate targetlists containing one or more SRFs. To avoid the complexity of the old way of handling nested expressions returning sets (e.g. having to pass up ExprDoneCond, and dealing with arguments to functions returning sets etc.), those SRFs can only be at the top level of the node's targetlist. The planner makes sure (via split_pathtarget_at_srfs()) that SRF evaluation is only necessary in ProjectSet nodes and that SRFs are only present at the top level of the node's targetlist. If there are nested SRFs the planner creates multiple stacked ProjectSet nodes. The ProjectSet nodes always get input from an underlying node. We also discussed and prototyped evaluating targetlist SRFs using ROWS FROM(), but that turned out to be more complicated than we'd hoped. While moving SRF evaluation to ProjectSet would allow to retain the old "least common multiple" behavior when multiple SRFs are present in one targetlist (i.e. continue returning rows until all SRFs are at the end of their input at the same time), we decided to instead only return rows till all SRFs are exhausted, returning NULL for already exhausted ones. We deemed the previous behavior to be too confusing, unexpected and actually not particularly useful. As a side effect, the previously prohibited case of multiple set returning arguments to a function, is now allowed. Not because it's particularly desirable, but because it ends up working and there seems to be no argument for adding code to prohibit it. Currently the behavior for COALESCE and CASE containing SRFs has changed, returning multiple rows from the expression, even when the SRF containing "arm" of the expression is not evaluated. That's because the SRFs are evaluated in a separate ProjectSet node. As that's quite confusing, we're likely to instead prohibit SRFs in those places. But that's still being discussed, and the code would reside in places not touched here, so that's a task for later. There's a lot of, now superfluous, code dealing with set return expressions around. But as the changes to get rid of those are verbose largely boring, it seems better for readability to keep the cleanup as a separate commit. Author: Tom Lane and Andres Freund Discussion: https://postgr.es/m/20160822214023.aaxz5l4igypowyri@alap3.anarazel.de
* Update copyright via script for 2017Bruce Momjian2017-01-03
|
* Fix typo in commentMagnus Hagander2016-04-26
| | | | Author: Daniel Gustafsson
* Update copyright for 2016Bruce Momjian2016-01-02
| | | | Backpatch certain files through 9.1
* Fix a couple of bugs in recent parallelism-related commits.Robert Haas2015-10-22
| | | | | | | | | | | Commit 816e336f12ecabdc834d4cc31bcf966b2dd323dc added the wrong error check to async.c; sending restrictions is restricted to the leader, not altogether unsafe. Commit 3bd909b220930f21d6e15833a17947be749e7fde added ExecShutdownNode to traverse the planstate tree and call shutdown functions, but made a Gather node, the only node that actually has such a function, abort the tree traversal, which is wrong.
* Add a Gather executor node.Robert Haas2015-09-30
| | | | | | | | | | | | | | | | | | | | | | | | A Gather executor node runs any number of copies of a plan in an equal number of workers and merges all of the results into a single tuple stream. It can also run the plan itself, if the workers are unavailable or haven't started up yet. It is intended to work with the Partial Seq Scan node which will be added in future commits. It could also be used to implement parallel query of a different sort by itself, without help from Partial Seq Scan, if the single_copy mode is used. In that mode, a worker executes the plan, and the parallel leader does not, merely collecting the worker's results. So, a Gather node could be inserted into a plan to split the execution of that plan across two processes. Nested Gather nodes aren't currently supported, but we might want to add support for that in the future. There's nothing in the planner to actually generate Gather nodes yet, so it's not quite time to break out the champagne. But we're getting close. Amit Kapila. Some designs suggestions were provided by me, and I also reviewed the patch. Single-copy mode, documentation, and other minor changes also by me.
* TABLESAMPLE, SQL Standard and extensibleSimon Riggs2015-05-15
| | | | | | | | | | | | | | Add a TABLESAMPLE clause to SELECT statements that allows user to specify random BERNOULLI sampling or block level SYSTEM sampling. Implementation allows for extensible sampling functions to be written, using a standard API. Basic version follows SQLStandard exactly. Usable concrete use cases for the sampling API follow in later commits. Petr Jelinek Reviewed by Michael Paquier and Simon Riggs
* Update copyright for 2015Bruce Momjian2015-01-06
| | | | Backpatch certain files through 9.0
* Introduce custom path and scan providers.Robert Haas2014-11-07
| | | | | | | | | | | This allows extension modules to define their own methods for scanning a relation, and get the core code to use them. It's unclear as yet how much use this capability will find, but we won't find out if we never commit it. KaiGai Kohei, reviewed at various times and in various levels of detail by Shigeru Hanada, Tom Lane, Andres Freund, Álvaro Herrera, and myself.
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.