aboutsummaryrefslogtreecommitdiff
path: root/src/include
Commit message (Collapse)AuthorAge
* Reduce "X = X" to "X IS NOT NULL", if it's easy to do so.Tom Lane2017-10-08
| | | | | | | | | | | | | | | | | | | | | If the operator is a strict btree equality operator, and X isn't volatile, then the clause must yield true for any non-null value of X, or null if X is null. At top level of a WHERE clause, we can ignore the distinction between false and null results, so it's valid to simplify the clause to "X IS NOT NULL". This is a useful improvement mainly because we'll get a far better selectivity estimate in most cases. Because such cases seldom arise in well-written queries, it is unappetizing to expend a lot of planner cycles looking for them ... but it turns out that there's a place we can shoehorn this in practically for free, because equivclass.c already has to detect and reject candidate equivalences of the form X = X. That doesn't catch every place that it would be valid to simplify to X IS NOT NULL, but it catches the typical case. Working harder doesn't seem justified. Patch by me, reviewed by Petr Jelinek Discussion: https://postgr.es/m/CAMjNa7cC4X9YR-vAJS-jSYCajhRDvJQnN7m2sLH1wLh-_Z2bsw@mail.gmail.com
* Fix crash when logical decoding is invoked from a PL function.Tom Lane2017-10-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logical decoding functions do BeginInternalSubTransaction and RollbackAndReleaseCurrentSubTransaction to clean up after themselves. It turns out that AtEOSubXact_SPI has an unrecognized assumption that we always need to cancel the active SPI operation in the SPI context that surrounds the subtransaction (if there is one). That's true when the RollbackAndReleaseCurrentSubTransaction call is coming from the SPI-using function itself, but not when it's happening inside some unrelated function invoked by a SPI query. In practice the affected callers are the various PLs. To fix, record the current subtransaction ID when we begin a SPI operation, and clean up only if that ID is the subtransaction being canceled. Also, remove AtEOSubXact_SPI's assertion that it must have cleaned up the surrounding SPI context's active tuptable. That's proven wrong by the same test case. Also clarify (or, if you prefer, reinterpret) the calling conventions for _SPI_begin_call and _SPI_end_call. The memory context cleanup in the latter means that these have always had the flavor of a matched resource-management pair, but they weren't documented that way before. Per report from Ben Chobot. Back-patch to 9.4 where logical decoding came in. In principle, the SPI changes should go all the way back, since the problem dates back to commit 7ec1c5a86. But given the lack of field complaints it seems few people are using internal subtransactions in this way. So I don't feel a need to take any risks in 9.2/9.3. Discussion: https://postgr.es/m/73FBA179-C68C-4540-9473-71E865408B15@silentmedia.com
* Copy information from the relcache instead of pointing to it.Robert Haas2017-10-06
| | | | | | | | | | We have the relations continuously locked, but not open, so relcache pointers are not guaranteed to be stable. Per buildfarm member prion. Ashutosh Bapat. I fixed a typo. Discussion: http://postgr.es/m/CAFjFpRcRBqoKLZSNmRsjKr81uEP=ennvqSQaXVCCBTXvJ2rW+Q@mail.gmail.com
* Fix traversal of half-frozen update chainsAlvaro Herrera2017-10-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When some tuple versions in an update chain are frozen due to them being older than freeze_min_age, the xmax/xmin trail can become broken. This breaks HOT (and probably other things). A subsequent VACUUM can break things in more serious ways, such as leaving orphan heap-only tuples whose root HOT redirect items were removed. This can be seen because index creation (or REINDEX) complain like ERROR: XX000: failed to find parent tuple for heap-only tuple at (0,7) in table "t" Because of relfrozenxid contraints, we cannot avoid the freezing of the early tuples, so we must cope with the results: whenever we see an Xmin of FrozenTransactionId, consider it a match for whatever the previous Xmax value was. This problem seems to have appeared in 9.3 with multixact changes, though strictly speaking it seems unrelated. Since 9.4 we have commit 37484ad2a "Change the way we mark tuples as frozen", so the fix is simple: just compare the raw Xmin (still stored in the tuple header, since freezing merely set an infomask bit) to the Xmax. But in 9.3 we rewrite the Xmin value to FrozenTransactionId, so the original value is lost and we have nothing to compare the Xmax with. To cope with that case we need to compare the Xmin with FrozenXid, assume it's a match, and hope for the best. Sadly, since you can pg_upgrade a 9.3 instance containing half-frozen pages to newer releases, we need to keep the old check in newer versions too, which seems a bit brittle; I hope we can somehow get rid of that. I didn't optimize the new function for performance. The new coding is probably a bit slower than before, since there is a function call rather than a straight comparison, but I'd rather have it work correctly than be fast but wrong. This is a followup after 20b655224249 fixed a few related problems. Apparently, in 9.6 and up there are more ways to get into trouble, but in 9.3 - 9.5 I cannot reproduce a problem anymore with this patch, so there must be a separate bug. Reported-by: Peter Geoghegan Diagnosed-by: Peter Geoghegan, Michael Paquier, Daniel Wood, Yi Wen Wong, Álvaro Discussion: https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
* Basic partition-wise join functionality.Robert Haas2017-10-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of joining two partitioned tables in their entirety we can, if it is an equi-join on the partition keys, join the matching partitions individually. This involves teaching the planner about "other join" rels, which are related to regular join rels in the same way that other member rels are related to baserels. This can use significantly more CPU time and memory than regular join planning, because there may now be a set of "other" rels not only for every base relation but also for every join relation. In most practical cases, this probably shouldn't be a problem, because (1) it's probably unusual to join many tables each with many partitions using the partition keys for all joins and (2) if you do that scenario then you probably have a big enough machine to handle the increased memory cost of planning and (3) the resulting plan is highly likely to be better, so what you spend in planning you'll make up on the execution side. All the same, for now, turn this feature off by default. Currently, we can only perform joins between two tables whose partitioning schemes are absolutely identical. It would be nice to cope with other scenarios, such as extra partitions on one side or the other with no match on the other side, but that will have to wait for a future patch. Ashutosh Bapat, reviewed and tested by Rajkumar Raghuwanshi, Amit Langote, Rafia Sabih, Thomas Munro, Dilip Kumar, Antonin Houska, Amit Khandekar, and by me. A few final adjustments by me. Discussion: http://postgr.es/m/CAFjFpRfQ8GrQvzp3jA2wnLqrHmaXna-urjm_UY9BqXj=EaDTSA@mail.gmail.com Discussion: http://postgr.es/m/CAFjFpRcitjfrULr5jfuKWRPsGUX0LQ0k8-yG0Qw2+1LBGNpMdw@mail.gmail.com
* Msvc doesn't know UINT16_MAX, replace with PG_UINT16_MAX.Andres Freund2017-10-04
| | | | | | UINT16_MAX usage is originating from commit 212e6f34d55c. Per buildfarm animal currawong.
* Replace binary search in fmgr_isbuiltin with a lookup array.Andres Freund2017-10-04
| | | | | | | | | | | | Turns out we have enough functions that the binary search is quite noticeable in profiles. Thus have Gen_fmgrtab.pl build a new mapping from a builtin function's oid to an index in the existing fmgr_builtins array. That keeps the additional memory usage at a reasonable amount. Author: Andres Freund, with input from Tom Lane Discussion: https://postgr.es/m/20170914065128.a5sk7z4xde5uy3ei@alap3.anarazel.de
* Allow multiple tables to be specified in one VACUUM or ANALYZE command.Tom Lane2017-10-03
| | | | | | | | | | | | | | Not much to say about this; does what it says on the tin. However, formerly, if there was a column list then the ANALYZE action was implied; now it must be specified, or you get an error. This is because it would otherwise be a bit unclear what the user meant if some tables have column lists and some don't. Nathan Bossart, reviewed by Michael Paquier and Masahiko Sawada, with some editorialization by me Discussion: https://postgr.es/m/E061A8E3-5E3D-494D-94F0-E8A9B312BBFC@amazon.com
* Fix race condition with unprotected use of a latch pointer variable.Tom Lane2017-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 597a87ccc introduced a latch pointer variable to replace use of a long-lived shared latch in the shared WalRcvData structure. This was not well thought out, because there are now hazards of the pointer variable changing while it's being inspected by another process. This could obviously lead to a core dump in code like if (WalRcv->latch) SetLatch(WalRcv->latch); and there's a more remote risk of a torn read, if we have any platforms where reading/writing a pointer is not atomic. An actual problem would occur only if the walreceiver process exits (gracefully) while the startup process is trying to signal it, but that seems well within the realm of possibility. To fix, treat the pointer variable (not the referenced latch) as being protected by the WalRcv->mutex spinlock. There remains a race condition that we could apply SetLatch to a process latch that no longer belongs to the walreceiver, but I believe that's harmless: at worst it'd cause an extra wakeup of the next process to use that PGPROC structure. Back-patch to v10 where the faulty code was added. Discussion: https://postgr.es/m/22735.1507048202@sss.pgh.pa.us
* Remove redundant stdint.h include.Andres Freund2017-10-01
| | | | Discussion: https://postgr.es/m/31674.1506788226@sss.pgh.pa.us
* Support arrays over domains.Tom Lane2017-09-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allowing arrays with a domain type as their element type was left un-done in the original domain patch, but not for any very good reason. This omission leads to such surprising results as array_agg() not working on a domain column, because the parser can't identify a suitable output type for the polymorphic aggregate. In order to fix this, first clean up the APIs of coerce_to_domain() and some internal functions in parse_coerce.c so that we consistently pass around a CoercionContext along with CoercionForm. Previously, we sometimes passed an "isExplicit" boolean flag instead, which is strictly less information; and coerce_to_domain() didn't even get that, but instead had to reverse-engineer isExplicit from CoercionForm. That's contrary to the documentation in primnodes.h that says that CoercionForm only affects display and not semantics. I don't think this change fixes any live bugs, but it makes things more consistent. The main reason for doing it though is that now build_coercion_expression() receives ccontext, which it needs in order to be able to recursively invoke coerce_to_target_type(). Next, reimplement ArrayCoerceExpr so that the node does not directly know any details of what has to be done to the individual array elements while performing the array coercion. Instead, the per-element processing is represented by a sub-expression whose input is a source array element and whose output is a target array element. This simplifies life in parse_coerce.c, because it can build that sub-expression by a recursive invocation of coerce_to_target_type(). The executor now handles the per-element processing as a compiled expression instead of hard-wired code. The main advantage of this is that we can use a single ArrayCoerceExpr to handle as many as three successive steps per element: base type conversion, typmod coercion, and domain constraint checking. The old code used two stacked ArrayCoerceExprs to handle type + typmod coercion, which was pretty inefficient, and adding yet another array deconstruction to do domain constraint checking seemed very unappetizing. In the case where we just need a single, very simple coercion function, doing this straightforwardly leads to a noticeable increase in the per-array-element runtime cost. Hence, add an additional shortcut evalfunc in execExprInterp.c that skips unnecessary overhead for that specific form of expression. The runtime speed of simple cases is within 1% or so of where it was before, while cases that previously required two levels of array processing are significantly faster. Finally, create an implicit array type for every domain type, as we do for base types, enums, etc. Everything except the array-coercion case seems to just work without further effort. Tom Lane, reviewed by Andrew Dunstan Discussion: https://postgr.es/m/9852.1499791473@sss.pgh.pa.us
* Fix copy & pasto in 510b8cbff15f.Andres Freund2017-09-29
| | | | Reported-By: Peter Geoghegan
* Fix typo.Andres Freund2017-09-29
| | | | Reported-By: Thomas Munro and Jesper Pedersen
* Extend & revamp pg_bswap.h infrastructure.Andres Freund2017-09-29
| | | | | | | | | | | | | | | | | Upcoming patches are going to address performance issues that involve slow system provided ntohs/htons etc. To address that expand pg_bswap.h to provide pg_ntoh{16,32,64}, pg_hton{16,32,64} and optimize their respective implementations by using compiler intrinsics for gcc compatible compilers and msvc. Fall back to manual implementations using shifts etc otherwise. Additionally remove multiple evaluation hazards from the existing BSWAP32/64 macros, by replacing them with inline functions when necessary. In the course of that the naming scheme is changed to pg_bswap16/32/64. Author: Andres Freund Discussion: https://postgr.es/m/20170927172019.gheidqy6xvlxb325@alap3.anarazel.de
* Add background worker typePeter Eisentraut2017-09-29
| | | | | | | | | | | | | | | | | Add bgw_type field to background worker structure. It is intended to be set to the same value for all workers of the same type, so they can be grouped in pg_stat_activity, for example. The backend_type column in pg_stat_activity now shows bgw_type for a background worker. The ps listing also no longer calls out that a process is a background worker but just show the bgw_type. That way, being a background worker is more of an implementation detail now that is not shown to the user. However, most log messages still refer to 'background worker "%s"'; otherwise constructing sensible and translatable log messages would become tricky. Reviewed-by: Michael Paquier <michael.paquier@gmail.com> Reviewed-by: Daniel Gustafsson <daniel@yesql.se>
* Remove replacement selection sort.Robert Haas2017-09-29
| | | | | | | | | | | | | | | At the time replacement_sort_tuples was introduced, there were still cases where replacement selection sort noticeably outperformed using quicksort even for the first run. However, those cases seem to have evaporated as a result of further improvements made since that time (and perhaps also advances in CPU technology). So remove replacement selection and the controlling GUC entirely. This makes tuplesort.c noticeably simpler and probably paves the way for further optimizations someone might want to do later. Peter Geoghegan, with review and testing by Tomas Vondra and me. Discussion: https://postgr.es/m/CAH2-WzmmNjG_K0R9nqYwMq3zjyJJK+hCbiZYNGhAy-Zyjs64GQ@mail.gmail.com
* Revert to 9.6 treatment of ALTER TYPE enumtype ADD VALUE.Tom Lane2017-09-27
| | | | | | | | | | | | | This reverts commit 15bc038f9, along with the followon commits 1635e80d3 and 984c92074 that tried to clean up the problems exposed by bug #14825. The result was incomplete because it failed to address parallel-query requirements. With 10.0 release so close upon us, now does not seem like the time to be adding more code to fix that. I hope we can un-revert this code and add the missing parallel query support during the v11 cycle. Back-patch to v10. Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
* Use a blacklist to distinguish original from add-on enum values.Tom Lane2017-09-26
| | | | | | | | | | | | | | | | | | | | | | | | Commit 15bc038f9 allowed ALTER TYPE ADD VALUE to be executed inside transaction blocks, by disallowing the use of the added value later in the same transaction, except under limited circumstances. However, the test for "limited circumstances" was heuristic and could reject references to enum values that were created during CREATE TYPE AS ENUM, not just later. This breaks the use-case of restoring pg_dump scripts in a single transaction, as reported in bug #14825 from Balazs Szilfai. We can improve this by keeping a "blacklist" table of enum value OIDs created by ALTER TYPE ADD VALUE during the current transaction. Any visible-but-uncommitted value whose OID is not in the blacklist must have been created by CREATE TYPE AS ENUM, and can be used safely because it could not have a lifespan shorter than its parent enum type. This change also removes the restriction that a renamed enum value can't be used before being committed (unless it was on the blacklist). Andrew Dunstan, with cosmetic improvements by me. Back-patch to v10. Discussion: https://postgr.es/m/20170922185904.1448.16585@wrigleys.postgresql.org
* Remove lsn from HashScanPosData.Robert Haas2017-09-26
| | | | | | | | | | | | | | | | This was intended as infrastructure for weakening VACUUM's locking requirements, similar to what was done for btree indexes in commit 2ed5b87f96d473962ec5230fd820abfeaccb2069. However, for hash indexes, it seems that the improvements which are possible are actually extremely marginal. Furthermore, performing the LSN cross-check will end up skipping cleanup far more often than is necessary; we only care about page modifications due to a VACUUM, but the LSN check will fail if ANY modification has occurred. So, rather than pressing forward with that "optimization", just rip the LSN field out. Patch by me, reviewed by Ashutosh Sharma and Amit Kapila Discussion: http://postgr.es/m/CAA4eK1JxqqcuC5Un7YLQVhOYSZBS+t=3xqZuEkt5RyquyuxpwQ@mail.gmail.com
* Avoid SIGBUS on Linux when a DSM memory request overruns tmpfs.Tom Lane2017-09-25
| | | | | | | | | | | | | | | | | | | | | On Linux, shared memory segments created with shm_open() are backed by swap files created in tmpfs. If the swap file needs to be extended, but there's no tmpfs space left, you get a very unfriendly SIGBUS trap. To avoid this, force allocation of the full request size when we create the segment. This adds a few cycles, but none that we wouldn't expend later anyway, assuming the request isn't hugely bigger than the actual need. Make this code #ifdef __linux__, because (a) there's not currently a reason to think the same problem exists on other platforms, and (b) applying posix_fallocate() to an FD created by shm_open() isn't very portable anyway. Back-patch to 9.4 where the DSM code came in. Thomas Munro, per a bug report from Amul Sul Discussion: https://postgr.es/m/1002664500.12301802.1471008223422.JavaMail.yahoo@mail.yahoo.com
* Refactor new file permission handlingPeter Eisentraut2017-09-23
| | | | | | | | | | | | | | | | | | | The file handling functions from fd.c were called with a diverse mix of notations for the file permissions when they were opening new files. Almost all files created by the server should have the same permissions set. So change the API so that e.g. OpenTransientFile() automatically uses the standard permissions set, and OpenTransientFilePerm() is a new function that takes an explicit permissions set for the few cases where it is needed. This also saves an unnecessary argument for call sites that are just opening an existing file. While we're reviewing these APIs, get rid of the FileName typedef and use the standard const char * for the file name and mode_t for the file mode. This makes these functions match other file handling functions and removes an unnecessary layer of mysteriousness. We can also get rid of a few casts that way. Author: David Steele <david@pgmasters.net>
* Add inline murmurhash32(uint32) function.Andres Freund2017-09-22
| | | | | | | | The function already existed in tidbitmap.c but more users requiring fast hashing of 32bit ints are coming up. Author: Andres Freund Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de
* Fix s/intidb/initdb/ typo.Andres Freund2017-09-22
| | | | | Reported-By: Michael Paquier Discussion: https://postgr.es/m/CAB7nPqTfaKAYZ4wuUM-W8kc4VnXrxX1=5-a9i==VoUPTMFpsgg@mail.gmail.com
* For wal_consistency_checking, mask page checksum as well as page LSN.Robert Haas2017-09-22
| | | | | | | | If the LSN is different, the checksum will be different, too. Ashwin Agrawal, reviewed by Michael Paquier and Kuntal Ghosh Discussion: http://postgr.es/m/CALfoeis5iqrAU-+JAN+ZzXkpPr7+-0OAGv7QUHwFn=-wDy4o4Q@mail.gmail.com
* hash: Implement page-at-a-time scan.Robert Haas2017-09-22
| | | | | | | | | | | | | | | | Commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272 added a similar optimization to btree back in 2006, but nobody bothered to implement the same thing for hash indexes, probably because they weren't WAL-logged and had lots of other performance problems as well. As with the corresponding btree case, this eliminates the problem of potentially needing to refind our position within the page, and cuts down on pin/unpin traffic as well. Ashutosh Sharma, reviewed by Alexander Korotkov, Jesper Pedersen, Amit Kapila, and me. Some final edits to comments and README by me. Discussion: http://postgr.es/m/CAE9k0Pm3KTx93K8_5j6VMzG4h5F+SyknxUwXrN-zqSZ9X8ZS3w@mail.gmail.com
* Assume wcstombs(), towlower(), and sibling functions are always present.Tom Lane2017-09-22
| | | | | | | | | | | | | | | These functions are required by SUS v2, which is our minimum baseline for Unix platforms, and are present on all interesting Windows versions as well. Even our oldest buildfarm members have them. Thus, we were not testing the "!USE_WIDE_UPPER_LOWER" code paths, which explains why the bug fixed in commit e6023ee7f escaped detection. Per discussion, there seems to be no more real-world value in maintaining this option. Hence, remove the configure-time tests for wcstombs() and towlower(), remove the USE_WIDE_UPPER_LOWER symbol, and remove all the !USE_WIDE_UPPER_LOWER code. There's not actually all that much of the latter, but simplifying the #if nests is a win in itself. Discussion: https://postgr.es/m/20170921052928.GA188913@rfd.leadboat.com
* Provide a test for variable existence in psqlAndrew Dunstan2017-09-21
| | | | | | | | | | | "\if :{?variable_name}" will be translated to "\if TRUE" if the variable exists and "\if FALSE" otherwise. Thus it will be possible to execute code conditionally on the existence of the variable, regardless of its value. Fabien Coelho, with some review by Robins Tharakan and some light text editing by me. Discussion: https://postgr.es/m/alpine.DEB.2.20.1708260835520.3627@lancre
* Associate partitioning information with each RelOptInfo.Robert Haas2017-09-20
| | | | | | | | | | | | | This is not used for anything yet, but it is necessary infrastructure for partition-wise join and for partition pruning without constraint exclusion. Ashutosh Bapat, reviewed by Amit Langote and with quite a few changes, mostly cosmetic, by me. Additional review and testing of this patch series by Antonin Houska, Amit Khandekar, Rafia Sabih, Rajkumar Raghuwanshi, Thomas Munro, and Dilip Kumar. Discussion: http://postgr.es/m/CAFjFpRfneFG3H+F6BaiXemMrKF+FY-POpx3Ocy+RiH3yBmXSNw@mail.gmail.com
* Make WAL segment size configurable at initdb time.Andres Freund2017-09-19
| | | | | | | | | | | | | | | | | | | | | | | For performance reasons a larger segment size than the default 16MB can be useful. A larger segment size has two main benefits: Firstly, in setups using archiving, it makes it easier to write scripts that can keep up with higher amounts of WAL, secondly, the WAL has to be written and synced to disk less frequently. But at the same time large segment size are disadvantageous for smaller databases. So far the segment size had to be configured at compile time, often making it unrealistic to choose one fitting to a particularly load. Therefore change it to a initdb time setting. This includes a breaking changes to the xlogreader.h API, which now requires the current segment size to be configured. For that and similar reasons a number of binaries had to be taught how to recognize the current segment size. Author: Beena Emerson, editorialized by Andres Freund Reviewed-By: Andres Freund, David Steele, Kuntal Ghosh, Michael Paquier, Peter Eisentraut, Robert Hass, Tushar Ahuja Discussion: https://postgr.es/m/CAOG9ApEAcQ--1ieKbhFzXSQPw_YLmepaa4hNdnY5+ZULpt81Mw@mail.gmail.com
* Remove no-op GiST support functions in the core GiST opclasses.Tom Lane2017-09-19
| | | | | | | | | | | | The preceding patch allowed us to remove useless GiST support functions. This patch actually does that for all the no-op cases in the core GiST code. This buys us whatever performance gain is to be had, and more importantly exercises the preceding patch. There remain no-op functions in the contrib GiST opclasses, but those will take more work to remove. Discussion: https://postgr.es/m/CAJEAwVELVx9gYscpE=Be6iJxvdW5unZ_LkcAaVNSeOwvdwtD=A@mail.gmail.com
* Avoid use of non-portable strnlen() in pgstat_clip_activity().Andres Freund2017-09-19
| | | | | | | | | The use of strnlen rather than strlen was just paranoia. Instead of giving up on the paranoia, just implement the safeguard differently. And add a comment explaining why we're careful. Author: Andres Freund Discussion: https://postgr.es/m/E1duOkJ-0001Mc-U5@gemulon.postgresql.org
* Speedup pgstat_report_activity by moving mb-aware truncation to read side.Andres Freund2017-09-19
| | | | | | | | | | | | | | | | | | | | Previously multi-byte aware truncation was done on every pgstat_report_activity() call - proving to be a bottleneck for workloads with long query strings that execute quickly. Instead move the truncation to the read side, which commonly is executed far less frequently. That's possible because all server encodings allow to determine the length of a multi-byte string from the first byte. Rename PgBackendStatus.st_activity to st_activity_raw so existing extension users of the field break - their code has to be adjusted to use pgstat_clip_activity(). Author: Andres Freund Tested-By: Khuntal Ghosh Reviewed-By: Robert Haas, Tom Lane Discussion: https://postgr.es/m/20170912071948.pa7igbpkkkviecpz@alap3.anarazel.de
* Fix crash restart bug introduced in 8356753c212.Andres Freund2017-09-18
| | | | | | | | | | | | | | | | | | The bug was caused by not re-reading the control file during crash recovery restarts, which lead to an attempt to pfree() shared memory contents. The fix is to re-read the control file, which seems good anyway. It's unclear as of this moment, whether we want to keep the refactoring introduced in the commit referenced above, or come up with an alternative approach. But fixing the bug in the mean time seems like a good idea regardless. A followup commit will introduce regression test coverage for crash restarts. Reported-By: Tom Lane Discussion: https://postgr.es/m/14134.1505572349@sss.pgh.pa.us
* Minor code-cleanliness improvements for btree.Tom Lane2017-09-18
| | | | | | | | | | | | | | Make the btree page-flags test macros (P_ISLEAF and friends) return clean boolean values, rather than values that might not fit in a bool. Use them in a few places that were randomly referencing the flag bits directly. In passing, change access/nbtree/'s only direct use of BUFFER_LOCK_SHARE to BT_READ. (Some think we should go the other way, but as long as we have BT_READ/BT_WRITE, let's use them consistently.) Masahiko Sawada, reviewed by Doug Doole Discussion: https://postgr.es/m/CAD21AoBmWPeN=WBB5Jvyz_Nt3rmW1ebUyAnk3ZbJP3RMXALJog@mail.gmail.com
* Make ExplainOpenGroup and ExplainCloseGroup public.Tom Lane2017-09-18
| | | | | | | | | Extensions with custom plan nodes might like to use these in their EXPLAIN output. Hadi Moshayedi Discussion: https://postgr.es/m/CA+_kT_dU-rHCN0u6pjA6bN5CZniMfD=-wVqPY4QLrKUY_uJq5w@mail.gmail.com
* Make DatumGetFoo/PG_GETARG_FOO/PG_RETURN_FOO macro names more consistent.Tom Lane2017-09-18
| | | | | | | | | | | | | | | | | | | | | By project convention, these names should include "P" when dealing with a pointer type; that is, if the result of a GETARG macro is of type FOO *, it should be called PG_GETARG_FOO_P not just PG_GETARG_FOO. Some newer types such as JSONB and ranges had not followed the convention, and a number of contrib modules hadn't gotten that memo either. Rename the offending macros to improve consistency. In passing, fix a few places that thought PG_DETOAST_DATUM() returns a Datum; it does not, it returns "struct varlena *". Applying DatumGetPointer to that happens not to cause any bad effects today, but it's formally wrong. Also, adjust an ltree macro that was designed without any thought for what pgindent would do with it. This is all cosmetic and shouldn't have any impact on generated code. Mark Dilger, some further tweaks by me Discussion: https://postgr.es/m/EA5676F4-766F-4F38-8348-ECC7DB427C6A@gmail.com
* Fix SQL-spec incompatibilities in new transition table feature.Tom Lane2017-09-16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The standard says that all changes of the same kind (insert, update, or delete) caused in one table by a single SQL statement should be reported in a single transition table; and by that, they mean to include foreign key enforcement actions cascading from the statement's direct effects. It's also reasonable to conclude that if the standard had wCTEs, they would say that effects of wCTEs applying to the same table as each other or the outer statement should be merged into one transition table. We weren't doing it like that. Hence, arrange to merge tuples from multiple update actions into a single transition table as much as we can. There is a problem, which is that if the firing of FK enforcement triggers and after-row triggers with transition tables is interspersed, we might need to report more tuples after some triggers have already seen the transition table. It seems like a bad idea for the transition table to be mutable between trigger calls. There's no good way around this without a major redesign of the FK logic, so for now, resolve it by opening a new transition table each time this happens. Also, ensure that AFTER STATEMENT triggers fire just once per statement, or once per transition table when we're forced to make more than one. Previous versions of Postgres have allowed each FK enforcement query to cause an additional firing of the AFTER STATEMENT triggers for the referencing table, but that's certainly not per spec. (We're still doing multiple firings of BEFORE STATEMENT triggers, though; is that something worth changing?) Also, forbid using transition tables with column-specific UPDATE triggers. The spec requires such transition tables to show only the tuples for which the UPDATE trigger would have fired, which means maintaining multiple transition tables or else somehow filtering the contents at readout. Maybe someday we'll bother to support that option, but it looks like a lot of trouble for a marginal feature. The transition tables are now managed by the AfterTriggers data structures, rather than being directly the responsibility of ModifyTable nodes. This removes a subtransaction-lifespan memory leak introduced by my previous band-aid patch 3c4359521. In passing, refactor the AfterTriggers data structures to reduce the management overhead for them, by using arrays of structs rather than several parallel arrays for per-query-level and per-subtransaction state. I failed to resist the temptation to do some copy-editing on the SGML docs about triggers, above and beyond merely documenting the effects of this patch. Back-patch to v10, because we don't want the semantics of transition tables to change post-release. Patch by me, with help and review from Thomas Munro. Discussion: https://postgr.es/m/20170909064853.25630.12825@wrigleys.postgresql.org
* Avoid duplicate typedef for SharedRecordTypmodRegistry.Tom Lane2017-09-15
| | | | | | | This isn't our usual solution for such problems, and older compilers (not terribly old, either) don't like it. Per buildfarm and local testing.
* Remove TupleDesc remapping logic from tqueue.c.Andres Freund2017-09-14
| | | | | | | | | | With the introduction of a shared memory record typmod registry, it is no longer necessary to remap record typmods when sending tuples between backends so most of tqueue.c can be removed. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
* Add support for coordinating record typmods among parallel workers.Andres Freund2017-09-14
| | | | | | | | | | | | | | | | | | | | | | Tuples can have type RECORDOID and a typmod number that identifies a blessed TupleDesc in a backend-private cache. To support the sharing of such tuples through shared memory and temporary files, provide a typmod registry in shared memory. To achieve that, introduce per-session DSM segments, created on demand when a backend first runs a parallel query. The per-session DSM segment has a table-of-contents just like the per-query DSM segment, and initially the contents are a shared record typmod registry and a DSA area to provide the space it needs to grow. State relating to the current session is accessed via a Session object reached through global variable CurrentSession that may require significant redesign further down the road as we figure out what else needs to be shared or remodelled. Author: Thomas Munro Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAEepm=0ZtQ-SpsgCyzzYpsXS6e=kZWqk3g5Ygn3MDV7A8dabUA@mail.gmail.com
* Perform only one ReadControlFile() during startup.Andres Freund2017-09-14
| | | | | | | | | | | | | Previously we read the control file in multiple places. But soon the segment size will be configurable and stored in the control file, and that needs to be available earlier than it currently is needed. Instead of adding yet another place where it's read, refactor things so there's a single processing of the control file during startup (in EXEC_BACKEND that's every individual backend's startup). Author: Andres Freund Discussion: http://postgr.es/m/20170913092828.aozd3gvvmw67gmyc@alap3.anarazel.de
* Expand partitioned table RTEs level by level, without flattening.Robert Haas2017-09-14
| | | | | | | | | | | | | | | | | | | | | | | | | Flattening the partitioning hierarchy at this stage makes various desirable optimizations difficult. The original use case for this patch was partition-wise join, which wants to match up the partitions in one partitioning hierarchy with those in another such hierarchy. However, it now seems that it will also be useful in making partition pruning work using the PartitionDesc rather than constraint exclusion, because with a flattened expansion, we have no easy way to figure out which PartitionDescs apply to which leaf tables in a multi-level partition hierarchy. As it turns out, we end up creating both rte->inh and !rte->inh RTEs for each intermediate partitioned table, just as we previously did for the root table. This seems unnecessary since the partitioned tables have no storage and are not scanned. We might want to go back and rejigger things so that no partitioned tables (including the parent) need !rte->inh RTEs, but that seems to require some adjustments not related to the core purpose of this patch. Ashutosh Bapat, reviewed by me and by Amit Langote. Some final adjustments by me. Discussion: http://postgr.es/m/CAFjFpRd=1venqLL7oGU=C1dEkuvk2DJgvF+7uKbnPHaum1mvHQ@mail.gmail.com
* Remove BoolPtr typePeter Eisentraut2017-09-14
| | | | | | Not used and doesn't seem useful. Reviewed-by: Thomas Munro <thomas.munro@enterprisedb.com>
* Fix ordering in pg_dump of GRANTsStephen Frost2017-09-13
| | | | | | | | | | | | | | | | | | | | | The order in which GRANTs are output is important as GRANTs which have been GRANT'd by individuals via WITH GRANT OPTION GRANTs have to come after the GRANT which included the WITH GRANT OPTION. This happens naturally in the backend during normal operation as we only change existing ACLs in-place, only add new ACLs to the end, and when removing an ACL we remove any which depend on it also. Also, adjust the comments in acl.h to make this clear. Unfortunately, the updates to pg_dump to handle initial privileges involved pulling apart ACLs and then combining them back together and could end up putting them back together in an invalid order, leading to dumps which wouldn't restore. Fix this by adjusting the queries used by pg_dump to ensure that the ACLs are rebuilt in the same order in which they were originally. Back-patch to 9.6 where the changes for initial privileges were done.
* Distinguish selectivity of < from <= and > from >=.Tom Lane2017-09-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, the selectivity functions have simply not distinguished < from <=, or > from >=, arguing that the fraction of the population that satisfies the "=" aspect can be considered to be vanishingly small, if the comparison value isn't any of the most-common-values for the variable. (If it is, the code path that executes the operator against each MCV will take care of things properly.) But that isn't really true unless we're dealing with a continuum of variable values, and in practice we seldom are. If "x = const" would estimate a nonzero number of rows for a given const value, then it follows that we ought to estimate different numbers of rows for "x < const" and "x <= const", even if the const is not one of the MCVs. Handling this more honestly makes a significant difference in edge cases, such as the estimate for a tight range (x BETWEEN y AND z where y and z are close together). Hence, split scalarltsel into scalarltsel/scalarlesel, and similarly split scalargtsel into scalargtsel/scalargesel. Adjust <= and >= operator definitions to reference the new selectivity functions. Improve the core ineq_histogram_selectivity() function to make a correction for equality. (Along the way, I learned quite a bit about exactly why that function gives good answers, which I tried to memorialize in improved comments.) The corresponding join selectivity functions were, and remain, just stubs. But I chose to split them similarly, to avoid confusion and to prevent the need for doing this exercise again if someone ever makes them less stubby. In passing, change ineq_histogram_selectivity's clamp for extreme probability estimates so that it varies depending on the histogram size, instead of being hardwired at 0.0001. With the default histogram size of 100 entries, you still get the old clamp value, but bigger histograms should allow us to put more faith in edge values. Tom Lane, reviewed by Aleksander Alekseev and Kuntal Ghosh Discussion: https://postgr.es/m/12232.1499140410@sss.pgh.pa.us
* Introduce BYTES unit for GUCs.Andres Freund2017-09-12
| | | | | | | | | This is already useful for track_activity_query_size, and will further be used in a later commit making the WAL segment size configurable. Author: Beena Emerson Reviewed-By: Andres Freund Discussion: https://postgr.es/m/CAOG9ApEu8bXVwBxkOO9J7ZpM76TASK_vFMEEiCEjwhMmSLiaqQ@mail.gmail.com
* Allow custom search filters to be configured for LDAP authPeter Eisentraut2017-09-12
| | | | | | | | | | | | | Before, only filters of the form "(<ldapsearchattribute>=<user>)" could be used to search an LDAP server. Introduce ldapsearchfilter so that more general filters can be configured using patterns, like "(|(uid=$username)(mail=$username))" and "(&(uid=$username) (objectClass=posixAccount))". Also allow search filters to be included in an LDAP URL. Author: Thomas Munro Reviewed-By: Peter Eisentraut, Mark Cave-Ayland, Magnus Hagander Discussion: https://postgr.es/m/CAEepm=0XTkYvMci0WRubZcf_1am8=gP=7oJErpsUfRYcKF2gwg@mail.gmail.com
* Message style fixesPeter Eisentraut2017-09-11
|
* Remove pre-order and post-order traversal logic for red-black trees.Tom Lane2017-09-10
| | | | | | | | | | | | | This code isn't used, and there's no clear reason why anybody would ever want to use it. These traversal mechanisms don't yield a visitation order that is semantically meaningful for any external purpose, nor are they any faster or simpler than the left-to-right or right-to-left traversals. (In fact, some rough testing suggests they are slower :-(.) Moreover, these mechanisms are impossible to test in any arm's-length fashion; doing so requires knowledge of the red-black tree's internal implementation. Hence, let's just jettison them. Discussion: https://postgr.es/m/17735.1505003111@sss.pgh.pa.us
* Allow a partitioned table to have a default partition.Robert Haas2017-09-08
| | | | | | | | | | | | | Any tuples that don't route to any other partition will route to the default partition. Jeevan Ladhe, Beena Emerson, Ashutosh Bapat, Rahila Syed, and Robert Haas, with review and testing at various stages by (at least) Rushabh Lathia, Keith Fiske, Amit Langote, Amul Sul, Rajkumar Raghuanshi, Sven Kunze, Kyotaro Horiguchi, Thom Brown, Rafia Sabih, and Dilip Kumar. Discussion: http://postgr.es/m/CAH2L28tbN4SYyhS7YV1YBWcitkqbhSWfQCy0G=apRcC_PEO-bg@mail.gmail.com Discussion: http://postgr.es/m/CAOG9ApEYj34fWMcvBMBQ-YtqR9fTdXhdN82QEKG0SVZ6zeL1xg@mail.gmail.com