aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
...
* Instrument freezing in autovacuum log reports.Peter Geoghegan2022-09-08
| | | | | | | | | | | Add a new line to log reports from autovacuum (as well as VACUUM VERBOSE output) that shows information about freezing. Emphasis is placed on the total number of heap pages that had one or more tuples frozen by VACUUM. The total number of tuples frozen is also shown. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Jeff Janes <jeff.janes@gmail.com> Discussion: https://postgr.es/m/CAH2-WznTY6D0zyE8VLrC6Gd4kh_HGAXxnTPtcOQOOsxzLx9zog@mail.gmail.com
* Fix recovery_prefetch with low maintenance_io_concurrency.Thomas Munro2022-09-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | We should process completed IOs *before* trying to start more, so that it is always possible to decode one more record when the decoded record queue is empty, even if maintenance_io_concurrency is set so low that a single earlier WAL record might have saturated the IO queue. That bug was hidden because the effect of maintenance_io_concurrency was arbitrarily clamped to be at least 2. Fix the ordering, and also remove that clamp. We need a special case for 0, which is now treated the same as recovery_prefetch=off, but otherwise the number is used directly. This allows for testing with 1, which would have made the problem obvious in simple test scenarios. Also add an explicit error message for missing contrecords. It was a bit strange that we didn't report an error already, and became a latent bug with prefetching, since the internal state that tracks aborted contrecords would not survive retrying, as revealed by 026_overwrite_contrecord.pl with this adjustment. Reporting an error prevents that. Back-patch to 15. Reported-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20220831140128.GS31833%40telsasoft.com
* Fix an assortment of improper usages of string functionsDavid Rowley2022-09-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | In a similar effort to f736e188c and 110d81728, fixup various usages of string functions where a more appropriate function is available and more fit for purpose. These changes include: 1. Use cstring_to_text_with_len() instead of cstring_to_text() when working with a StringInfoData and the length can easily be obtained. 2. Use appendStringInfoString() instead of appendStringInfo() when no formatting is required. 3. Use pstrdup(...) instead of psprintf("%s", ...) 4. Use pstrdup(...) instead of psprintf(...) (with no formatting) 5. Use appendPQExpBufferChar() instead of appendPQExpBufferStr() when the length of the string being appended is 1. 6. appendStringInfoChar() instead of appendStringInfo() when no formatting is required and string is 1 char long. 7. Use appendPQExpBufferStr(b, .) instead of appendPQExpBuffer(b, "%s", .) 8. Don't use pstrdup when it's fine to just point to the string constant. I (David) did find other cases of #8 but opted to use #4 instead as I wasn't certain enough that applying #8 was ok (e.g in hba.c) Author: Ranier Vilela, David Rowley Discussion: https://postgr.es/m/CAApHDvo2j2+RJBGhNtUz6BxabWWh2Jx16wMUMWKUjv70Ver1vg@mail.gmail.com
* Fix incorrect uses of Datum conversion macrosPeter Eisentraut2022-09-05
| | | | | | | | | | | Since these macros just cast whatever you give them to the designated output type, and many normal uses also cast the output type further, a number of incorrect uses go undiscovered. The fixes in this patch have been discovered by changing these macros to inline functions, which is the subject of a future patch. Reviewed-by: Aleksander Alekseev <aleksander@timescale.com> Discussion: https://www.postgresql.org/message-id/flat/8528fb7e-0aa2-6b54-85fb-0c0886dbd6ed%40enterprisedb.com
* Fix cache invalidation bug in recovery_prefetch.Thomas Munro2022-09-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XLogPageRead() can retry internally after a pread() system call has succeeded, in the case of short reads, and page validation failures while in standby mode (see commit 0668719801). Due to an oversight in commit 3f1ce973, these cases could leave stale data in the internal cache of xlogreader.c without marking it invalid. The main defense against stale cached data on failure to read a page was in the error handling path of the calling function ReadPageInternal(), but that wasn't quite enough for errors handled internally by XLogPageRead()'s retry loop if we then exited with XLREAD_WOULDBLOCK. 1. ReadPageInternal() now marks the cache invalid before calling the page_read callback, by setting state->readLen to 0. It'll be set to a non-zero value only after a successful read. It'll stay valid as long as the caller requests data in the cached range. 2. XLogPageRead() no long performs internal retries while reading ahead. While such retries should work, the general philosophy is that we should give up prefetching if anything unusual happens so we can handle it when recovery catches up, to reduce the complexity of the system. Let's do that here too. 3. While here, a new function XLogReaderResetError() improves the separation between xlogrecovery.c and xlogreader.c, where the former previously clobbered the latter's internal error buffer directly. The new function makes this more explicit, and also clears a related flag, without which a standby would needlessly retry in the outer function. Thanks to Noah Misch for tracking down the conditions required for a rare build farm failure in src/bin/pg_ctl/t/003_promote.pl, and providing a reproducer. Back-patch to 15. Reported-by: Noah Misch <noah@leadboat.com> Discussion: https://postgr.es/m/20220807003627.GA4168930%40rfd.leadboat.com
* Expand the use of get_dirent_type(), shaving a few calls to stat()/lstat()Michael Paquier2022-09-02
| | | | | | | | | | | | | | | | | | | Several backend-side loops scanning one or more directories with ReadDir() (WAL segment recycle/removal in xlog.c, backend-side directory copy, temporary file removal, configuration file parsing, some logical decoding logic and some pgtz stuff) already know the type of the entry being scanned thanks to the dirent structure associated to the entry, on platforms where we know about DT_REG, DT_DIR and DT_LNK to make the difference between a regular file, a directory and a symbolic link. Relying on the direct structure of an entry saves a few system calls to stat() and lstat() in the loops updated here, shaving some code while on it. The logic of the code remains the same, calling stat() or lstat() depending on if it is necessary to look through symlinks. Authors: Nathan Bossart, Bharath Rupireddy Reviewed-by: Andres Freund, Thomas Munro, Michael Paquier Discussion: https://postgr.es/m/CALj2ACV8n-J-f=yiLUOx2=HrQGPSOZM3nWzyQQvLPcccPXxEdg@mail.gmail.com
* Derive freeze cutoff from nextXID, not OldestXmin.Peter Geoghegan2022-08-31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before now, the cutoffs that VACUUM used to determine which XIDs/MXIDs to freeze were determined at the start of each VACUUM by taking related cutoffs that represent which XIDs/MXIDs VACUUM should treat as still running, and subtracting an XID/MXID age based value controlled by GUCs like vacuum_freeze_min_age. The FreezeLimit cutoff (XID freeze cutoff) was derived by subtracting an XID age value from OldestXmin, while the MultiXactCutoff cutoff (MXID freeze cutoff) was derived by subtracting an MXID age value from OldestMxact. This approach didn't match the approach used nearby to determine whether this VACUUM operation should be an aggressive VACUUM or not. VACUUM now uses the standard approach instead: it subtracts the same age-based values from next XID/next MXID (rather than subtracting from OldestXmin/OldestMxact). This approach is simpler and more uniform. Most of the time it will have only a negligible impact on how and when VACUUM freezes. It will occasionally make VACUUM more robust in the event of problems caused by long running transaction. These are cases where OldestXmin and OldestMxact are held back by so much that they attain an age that is a significant fraction of the value of age-based settings like vacuum_freeze_min_age. There is no principled reason why freezing should be affected in any way by the presence of a long-running transaction -- at least not before the point that the OldestXmin and OldestMxact limits used by each VACUUM operation attain an age that makes it unsafe to freeze some of the XIDs/MXIDs whose age exceeds the value of the relevant age-based settings. The new approach should at least make freezing degrade more gracefully than before, even in the most extreme cases. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Nathan Bossart <nathandbossart@gmail.com> Reviewed-By: Matthias van de Meent <boekewurm+postgres@gmail.com> Discussion: https://postgr.es/m/CAH2-WzkOv5CEeyOO=c91XnT5WBR_0gii0Wn5UbZhJ=4TTykDYg@mail.gmail.com
* Adjust comments that called MultiXactIds "XMIDs".Peter Geoghegan2022-08-29
| | | | Oversights in commits 0b018fab and f3c15cbe.
* Clean up inconsistent use of fflush().Tom Lane2022-08-29
| | | | | | | | | | | | | | | | | | | | | | More than twenty years ago (79fcde48b), we hacked the postmaster to avoid a core-dump on systems that didn't support fflush(NULL). We've mostly, though not completely, hewed to that rule ever since. But such systems are surely gone in the wild, so in the spirit of cleaning out no-longer-needed portability hacks let's get rid of multiple per-file fflush() calls in favor of using fflush(NULL). Also, we were fairly inconsistent about whether to fflush() before popen() and system() calls. While we've received no bug reports about that, it seems likely that at least some of these call sites are at risk of odd behavior, such as error messages appearing in an unexpected order. Rather than expend a lot of brain cells figuring out which places are at hazard, let's just establish a uniform coding rule that we should fflush(NULL) before these calls. A no-op fflush() is surely of trivial cost compared to launching a sub-process via a shell; while if it's not a no-op then we likely need it. Discussion: https://postgr.es/m/2923412.1661722825@sss.pgh.pa.us
* Prevent WAL corruption after a standby promotion.Robert Haas2022-08-29
| | | | | | | | | | | | | | | | | | | | | | | | | | | When a PostgreSQL instance performing archive recovery but not using standby mode is promoted, and the last WAL segment that it attempted to read ended in a partial record, the previous code would create invalid WAL on the new timeline. The WAL from the previously timeline would be copied to the new timeline up until the end of the last valid record, but instead of beginning to write WAL at immediately afterwards, the promoted server would write an overwrite contrecord at the beginning of the next segment. The end of the previous segment would be left as all-zeroes, resulting in failures if anything tried to read WAL from that file. The root of the issue is that ReadRecord() decides whether to set abortedRecPtr and missingContrecPtr based on the value of StandbyMode, but ReadRecord() switches to a new timeline based on the value of ArchiveRecoveryRequested. We shouldn't try to write an overwrite contrecord if we're switching to a new timeline, so change the test in ReadRecod() to check ArchiveRecoveryRequested instead. Code fix by Dilip Kumar. Comments by me incorporating suggested language from Álvaro Herrera. Further review from Kyotaro Horiguchi and Sami Imseih. Discussion: http://postgr.es/m/CAFiTN-t7umki=PK8dT1tcPV=mOUe2vNhHML6b3T7W7qqvvajjg@mail.gmail.com Discussion: http://postgr.es/m/FB0DEA0B-E14E-43A0-811F-C1AE93D00FF3%40amazon.com
* Small refactor to get rid of -Wshadow=compatible-local warningDavid Rowley2022-08-26
| | | | | | | | | | Further reduce -Wshadow=compatible-local warnings by 1 by refactoring the code in gistRelocateBuildBuffersOnSplit() to make use of foreach_current_index() instead of manually incrementing a variable on each loop. Author: David Rowley Discussion: https://postgr.es/m/CAApHDvpGZX-X=Bn4moyXgfFa0CdSUwoa04d3isit3=1qo8F8Bw@mail.gmail.com
* More -Wshadow=compatible-local warning fixesDavid Rowley2022-08-26
| | | | | | | | | | | | In a similar effort to f01592f91, here we're targetting fixing the warnings where we've deemed the shadowing variable to serve a close enough purpose to the shadowed variable just to reuse the shadowed version and not declare the shadowing variable at all. By my count, this takes the warning count from 106 down to 71. Author: Justin Pryzby Discussion: https://postgr.es/m/20220825020839.GT2342@telsasoft.com
* Further -Wshadow=compatible-local warning fixesDavid Rowley2022-08-24
| | | | | | | | | | | | | These should have been included in 421892a19 as these shadowed variable warnings can also be fixed by adjusting the scope of the shadowed variable to put the declaration for it in an inner scope. This is part of the same effort as f01592f91. By my count, this takes the warning count from 114 down to 106. Author: David Rowley and Justin Pryzby Discussion: https://postgr.es/m/CAApHDvrwLGBP%2BYw9vriayyf%3DXR4uPWP5jr6cQhP9au_kaDUhbA%40mail.gmail.com
* Allow parallel workers to retrieve some data from PortMichael Paquier2022-08-24
| | | | | | | | | | | | | | | | | | | | | | | | | This commit moves authn_id into a new global structure called ClientConnectionInfo (mapping to a MyClientConnectionInfo for each backend) which is intended to hold all the client information that should be shared between the backend and any of its parallel workers, access for extensions and triggers being the primary use case. There is no need to push all the data of Port to the workers, and authn_id is quite a generic concept so using a separate structure provides the best balance (the name of the structure has been suggested by Robert Haas). While on it, and per discussion as this would be useful for a potential SYSTEM_USER that can be accessed through parallel workers, a second field is added for the authentication method, copied directly from Port. ClientConnectionInfo is serialized and restored using a new parallel key and a structure tracks the length of the authn_id, making the addition of more fields straight-forward. Author: Jacob Champion Reviewed-by: Bertrand Drouvot, Stephen Frost, Robert Haas, Tom Lane, Michael Paquier, Julien Rouhaud Discussion: https://postgr.es/m/793d990837ae5c06a558d58d62de9378ab525d83.camel@vmware.com
* Further reduce warnings with -Wshadow=compatible-localDavid Rowley2022-08-24
| | | | | | | | | | | | | | | | | | | | | | | In a similar effort to f01592f91, here we're targetting fixing the warnings that -Wshadow=compatible-local produces that we can fix by moving a variable to an inner scope to stop that variable from being shadowed by another variable declared somewhere later in the function. All of the warnings being fixed here are changing the scope of variables which are being used as an iterator for a "for" loop. In each instance, the fix happens to be changing the for loop to use the C99 type initialization. Much of this code likely pre-dates our use of C99. Reducing the scope of the outer scoped variable seems like the safest way to fix these. Renaming seems more likely to risk patches using the wrong variable. Reducing the scope is more likely to result in a compilation failure after applying some future patch rather than introducing bugs with it. By my count, this takes the warning count from 129 down to 114. Author: Justin Pryzby Discussion: https://postgr.es/m/CAApHDvrwLGBP%2BYw9vriayyf%3DXR4uPWP5jr6cQhP9au_kaDUhbA%40mail.gmail.com
* Remove empty statementJohn Naylor2022-08-23
| | | | | | Peter Smith Discussion: https://www.postgresql.org/message-id/CAHut%2BPtRGVuj8Q_GpHHxZyk7fGwdYDG8_s4GSfKoc_4Yd9vR-w%40mail.gmail.com
* Adjust assertion in XLogDecodeNextRecord.Robert Haas2022-08-18
| | | | | | | | | | | | | | | | | As written, if you use XLogBeginRead() to position an xlogreader at the beginning of a WAL page and then try to read WAL, this assertion will fail. However, the header comment for XLogBeginRead() claims that positioning an xlogreader at the beginning of a page is valid, and the code here is perfectly able to cope with it. It's only the assertion that causes trouble. So relax it. This is formally a bug in all supported branches, but as it doesn't seem to have any consequences for current uses of the xlogreader facility, no back-patch, at least for now. Dilip Kumar and Robert Haas Discussion: http://postgr.es/m/CA+TgmoaJSs2_7WHW2GzFYe9+zfPtxBKvT3GW47+x=ptUE=cULw@mail.gmail.com
* Use SetInstallXLogFileSegmentActive() in more places in xlog.cMichael Paquier2022-08-17
| | | | | | | | | | This reduces the code paths where XLogCtl->InstallXLogFileSegmentActive is directly touched, and this wrapper function does the same thing as the original code replaced by the function call. Author: Bharath Rupireddy Reviewed-by: Nathan Bossart Discussion: https://postgr.es/m/CALj2ACVhkf-bC5CX-=6iBUfkO5GqmBntQH+m=HpY0iQ=-g1pRg@mail.gmail.com
* Fix assert in logicalmsg_descTomas Vondra2022-08-16
| | | | | | | | | | | The assert, introduced by 9f1cf97bb5, is intended to check if the prefix is terminated by a \0 byte, but it has two flaws. Firstly, prefix_size includes the \0 byte, so prefix[prefix_size] points to the byte after the null byte. Secondly, the check ensures the byte is not equal \0, while it should be checking the opposite. Backpatch-through: 14 Discussion: https://postgr.es/m/b99b6101-2f14-3796-3dfa-4a6cd7d4326d@enterprisedb.com
* Move basebackup code to new directory src/backend/backupRobert Haas2022-08-10
| | | | | | Reviewed by David Steele and Justin Pryzby Discussion: http://postgr.es/m/CA+TgmoafqboATDSoXHz8VLrSwK_MDhjthK4hEpYjqf9_1Fmczw%40mail.gmail.com
* Fix obsolete comment in commit_ts.c.Thomas Munro2022-08-09
| | | | | | | | Commit 08aa89b removed COMMIT_TS_SETTS, but left a reference in a comment. Author: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://postgr.es/m/20220726173343.GA154110%40nathanxps13
* Fix comments about deduplication updating page.Peter Geoghegan2022-08-05
| | | | | | | | | | | nbtree deduplication passes add tuples from the original/target page to a temp page, merging as necessary. The temp page is copied back to the target permanent page in the critical section. This is similar to the approach taken by nbtree page splits. Adjust comments that referred to updating the original page in-place as tuples were merged. These were left over from earlier versions of the deduplication patch that didn't yet use a temp page.
* Add missing parenthesis to max item size macro.Peter Geoghegan2022-08-05
| | | | Oversight in commit 92f37505, per buildfarm.
* BRIN: mask BRIN_EVACUATE_PAGE for WAL consistency checkingAlvaro Herrera2022-08-05
| | | | | | | | | | | | | | | | That bit is unlogged and therefore it's wrong to consider it in WAL page comparison. Add a test that tickles the case, as branch testing technology allows. This has been a problem ever since wal consistency checking was introduced (commit a507b86900f6 for pg10), so backpatch to all supported branches. Author: 王海洋 (Haiyang Wang) <wanghaiyang.001@bytedance.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/CACciXAD2UvLMOhc4jX9VvOKt7DtYLr3OYRBhvOZ-jRxtzc_7Jg@mail.gmail.com Discussion: https://postgr.es/m/CACciXADOfErX9Bx0nzE_SkdfXr6Bbpo5R=v_B6MUTEYW4ya+cg@mail.gmail.com
* Remove configure probe for fdatasync.Thomas Munro2022-08-05
| | | | | | | | | | | | | | | | | fdatasync() is in SUSv2, and all targeted Unix systems have it. We have a replacement function for Windows. We retain the probe for the function declaration, which allows us to supply the mysteriously missing declaration for macOS, and also for Windows. No need to keep a HAVE_FDATASYNC macro around. Also rename src/port/fdatasync.c to win32fdatasync.c since it's only for Windows. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com Discussion: https://postgr.es/m/CA%2BhUKGJZJVO%3DiX%2Beb-PXi2_XS9ZRqnn_4URh0NUQOwt6-_51xQ%40mail.gmail.com
* Remove dead pread and pwrite replacement code.Thomas Munro2022-08-05
| | | | | | | | | | | | | | | | | | | | | | | pread() and pwrite() are in SUSv2, and all targeted Unix systems have them. Previously, we defined pg_pread and pg_pwrite to emulate these function with lseek() on old Unixen. The names with a pg_ prefix were a reminder of a portability hazard: they might change the current file position. That hazard is gone, so we can drop the prefixes. Since the remaining replacement code is Windows-only, move it into src/port/win32p{read,write}.c, and move the declarations into src/include/port/win32_port.h. No need for vestigial HAVE_PREAD, HAVE_PWRITE macros as they were only used for declarations in port.h which have now moved into win32_port.h. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Greg Stark <stark@mit.edu> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
* Remove configure probes for symlink/readlink, and dead code.Thomas Munro2022-08-05
| | | | | | | | | | | | | | | | | | | symlink() and readlink() are in SUSv2 and all targeted Unix systems have them. We have partial emulation on Windows. Code that raised runtime errors on systems without it has been dead for years, so we can remove that and also references to such systems in the documentation. Define HAVE_READLINK and HAVE_SYMLINK macros on Unix. Our Windows replacement functions based on junction points can't be used for relative paths or for non-directories, so the macros can be used to check for full symlink support. The places that deal with tablespaces can just use symlink functions without checking the macros. (If they did check the macros, they'd need to provide an #else branch with a runtime or compile time error, and it'd be dead code.) Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA+hUKGJ3LHeP9w5Fgzdr4G8AnEtJ=z=p6hGDEm4qYGEUX5B6fQ@mail.gmail.com
* Rephrase comments to make them clearerDaniel Gustafsson2022-08-04
| | | | | | | | | | | | The use of "we" when referring to the active backend might be misunderstood, so rephrase to make it clearer who is performing the actions discussed in the comment. Author: Junwang Zhao <zhjwpku@gmail.com> Reviewed-by: Erikjan Rijkers <er@xs4all.nl> Reviewed-by: Robert Treat <rob@xzilla.net> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CAEG8a3LRSMqkvjiURiJoSi4aGWORpiXUmUfQQK5PaD6WfPzu3w@mail.gmail.com
* Improve speed of hash index build.Tom Lane2022-07-28
| | | | | | | | | | | | | In the initial data sort, if the bucket numbers are the same then next sort on the hash value. Because index pages are kept in hash value order, this gains a little speed by allowing the eventual tuple insertions to be done sequentially, avoiding repeated data movement within PageAddItem. This seems to be good for overall speedup of 5%-9%, depending on the incoming data. Simon Riggs, reviewed by Amit Kapila Discussion: https://postgr.es/m/CANbhV-FG-1ZNMBuwhUF7AxxJz3u5137dYL-o6hchK1V_dMw86g@mail.gmail.com
* Fix replay of create database records on standbyAlvaro Herrera2022-07-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Crash recovery on standby may encounter missing directories when replaying database-creation WAL records. Prior to this patch, the standby would fail to recover in such a case; however, the directories could be legitimately missing. Consider the following sequence of commands: CREATE DATABASE DROP DATABASE DROP TABLESPACE If, after replaying the last WAL record and removing the tablespace directory, the standby crashes and has to replay the create database record again, crash recovery must be able to continue. A fix for this problem was already attempted in 49d9cfc68bf4, but it was reverted because of design issues. This new version is based on Robert Haas' proposal: any missing tablespaces are created during recovery before reaching consistency. Tablespaces are created as real directories, and should be deleted by later replay. CheckRecoveryConsistency ensures they have disappeared. The problems detected by this new code are reported as PANIC, except when allow_in_place_tablespaces is set to ON, in which case they are WARNING. Apart from making tests possible, this gives users an escape hatch in case things don't go as planned. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Author: Asim R Praveen <apraveen@pivotal.io> Author: Paul Guo <paulguo@gmail.com> Reviewed-by: Anastasia Lubennikova <lubennikovaav@gmail.com> (older versions) Reviewed-by: Fujii Masao <masao.fujii@oss.nttdata.com> (older versions) Reviewed-by: Michaël Paquier <michael@paquier.xyz> Diagnosed-by: Paul Guo <paulguo@gmail.com> Discussion: https://postgr.es/m/CAEET0ZGx9AvioViLf7nbR_8tH9-=27DN5xWJ2P9-ROH16e4JUA@mail.gmail.com
* Add overflow protection for block-related data in WAL recordsMichael Paquier2022-07-27
| | | | | | | | | | | | | | | | | | | | | | | | | | XLogRecordBlockHeader, the header holding the information for the data related to a block, tracks the length of the data appended to the WAL record with data_length (uint16). This limitation in size was not enforced by the public routine in charge of registering the data assembled later to form the WAL record inserted, XLogRegisterBufData(). Incorrectly used, it could lead to the generation of records with some of its data overflowed. This commit adds some safeguards to prevent that for the block data, complaining immediately if attempting to add to a record block information with a size larger than UINT16_MAX, which is the limit implied by the internal logic. Note that this also adjusts XLogRegisterData() and XLogRegisterBufData() so as the length of the WAL record data given by the caller is unsigned, matching with what gets stored in XLogRecData->len. Extracted from a larger patch by the same author. The original patch includes more protections when assembling a record in full that will be looked at separately later. Author: Matthias van de Meent Reviewed-by: Andres Freund, Heikki Linnakangas, Michael Paquier, David Zhang Discussion: https://postgr.es/m/CAEze2WgGiw+LZt+vHf8tWqB_6VxeLsMeoAuod0N=ij1q17n5pw@mail.gmail.com
* Force immediate commit after CREATE DATABASE etc in extended protocol.Tom Lane2022-07-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a few commands that "can't run in a transaction block", meaning that if they complete their processing but then we fail to COMMIT, we'll be left with inconsistent on-disk state. However, the existing defenses for this are only watertight for simple query protocol. In extended protocol, we didn't commit until receiving a Sync message. Since the client is allowed to issue another command instead of Sync, we're in trouble if that command fails or is an explicit ROLLBACK. In any case, sitting in an inconsistent state while waiting for a client message that might not come seems pretty risky. This case wasn't reachable via libpq before we introduced pipeline mode, but it's always been an intended aspect of extended query protocol, and likely there are other clients that could reach it before. To fix, set a flag in PreventInTransactionBlock that tells exec_execute_message to force an immediate commit. This seems to be the approach that does least damage to existing working cases while still preventing the undesirable outcomes. While here, add some documentation to protocol.sgml that explicitly says how to use pipelining. That's latent in the existing docs if you know what to look for, but it's better to spell it out; and it provides a place to document this new behavior. Per bug #17434 from Yugo Nagata. It's been wrong for ages, so back-patch to all supported branches. Discussion: https://postgr.es/m/17434-d9f7a064ce2a88a3@postgresql.org
* Eliminate duplicate code in table.c.Amit Kapila2022-07-26
| | | | | | | | | Additionally improve the error message similar to how it was done in 2ed532ee8c. Author: Junwang Zhao, Aleksander Alekseev Reviewed-by: Amit Kapila, Alvaro Herrera, Kyotaro Horiguchi Discussion: https://postgr.es/m/CAEG8a3KbVtBm_BYf5tGsKHvmMieQVsq_jBPOg75VViQB7ACL8Q%40mail.gmail.com
* Remove useless arguments in ReadCheckpointRecord().Fujii Masao2022-07-25
| | | | | | | | | | | | | | | | | | | | | | | | | This commit removes two arguments "report" and "whichChkpt" in ReadCheckpointRecord(). "report" is obviously useless because it's always true, i.e., there are two callers of the function and they always specify true as "report". Commit 1d919de5eb removed the only call with "report" = false. "whichChkpt" indicated where the specified checkpoint location came from, pg_control or backup_label. This information was used to report different error messages depending on where the invalid checkpoint record came from, when it was found. But ReadCheckpointRecord() doesn't need to do that because its callers already do that and users can still identify where the invalid checkpoint record came from, by reading such log messages. Also when "whichChkpt" was 0, the word "primary checkpoint" was used in the log message and could confuse users because the concept of primary and secondary checkpoints was already removed before. These are why this commit removes "whichChkpt" argument. Author: Fujii Masao Reviewed-by: Bharath Rupireddy, Kyotaro Horiguchi Discussion: https://postgr.es/m/fa2e12eb-81c3-0717-0272-755f8a81c8f2@oss.nttdata.com
* Remove unnecessary Windows-specific basebackup code.Thomas Munro2022-07-22
| | | | | | | | | | | | | | | | | Commit c6f2f016 added an explicit check for a Windows "junction point". That turned out to be needed only because get_dirent_type() was busted on Windows. It's been fixed by commit 9d3444dc, so remove it. Add a TAP-test to demonstrate that in-place tablespaces are copied by pg_basebackup. This exercises the codepath that would fail before c6f2f016 on Windows, and shows that it still doesn't fail now that we're using get_dirent_type() on both Windows and Unix. Back-patch to 15, where in-place tablespaces arrived and caused this problem (ie directories where previously only symlinks were expected). Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGLzLK4PUPx0_AwXEWXOYAejU%3D7XpxnYE55Y%2Be7hB2N3FA%40mail.gmail.com
* Remove O_FSYNC and associated macros.Thomas Munro2022-07-22
| | | | | | | | | | | | | | | | | | O_FSYNC was a pre-POSIX way of spelling O_SYNC, supported since commit 9d645fd84c3 for non-conforming operating systems of the time. It's not needed on any modern system. We can just use standard O_SYNC directly if it exists (= all targeted systems except Windows), and get rid of our OPEN_SYNC_FLAG macro. Similarly for standard O_DSYNC, we can just use that directly if it exists (= all targeted systems except DragonFlyBSD), and get rid of our OPEN_DATASYNC_FLAG macro. We still avoid choosing open_datasync as a default value for wal_sync_method if O_DSYNC has the same value as O_SYNC (= only OpenBSD), so there is no change in default behavior. Discussion: https://postgr.es/m/CA%2BhUKGJE7y92NY7FG2ftUbZUaqohBU65_Ys_7xF5mUHo4wirTQ%40mail.gmail.com
* Remove fls(), use pg_leftmost_one_pos32() instead.Thomas Munro2022-07-22
| | | | | | | | | | | | | | Commit 4f658dc8 provided the traditional BSD fls() function in src/port/fls.c so it could be used in several places. Later we added a bunch of similar facilities in pg_bitutils.h, based on compiler builtins that map to hardware instructions. It's a bit confusing to have both 1-based and 0-based variants of this operation in use in different parts of the tree, and neither is blessed by a standard. Let's drop fls.c and the configure probe, and reuse the newer code. Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://postgr.es/m/CA%2BhUKG%2B7dSX1XF8yFGmYk-%3D48dbjH2kmzZj16XvhbrWP-9BzRg%40mail.gmail.com
* Fix assertion failure and segmentation fault in backup code.Fujii Masao2022-07-20
| | | | | | | | | | | | | | | | | | | | | | When a non-exclusive backup is canceled, do_pg_abort_backup() is called and resets some variables set by pg_backup_start (pg_start_backup in v14 or before). But previously it forgot to reset the session state indicating whether a non-exclusive backup is in progress or not in this session. This issue could cause an assertion failure when the session running BASE_BACKUP is terminated after it executed pg_backup_start and pg_backup_stop (pg_stop_backup in v14 or before). Also it could cause a segmentation fault when pg_backup_stop is called after BASE_BACKUP in the same session is canceled. This commit fixes the issue by making do_pg_abort_backup reset that session state. Back-patch to all supported branches. Author: Fujii Masao Reviewed-by: Kyotaro Horiguchi, Masahiko Sawada, Michael Paquier, Robert Haas Discussion: https://postgr.es/m/3374718f-9fbf-a950-6d66-d973e027f44c@oss.nttdata.com
* Replace many MemSet calls with struct initializationPeter Eisentraut2022-07-16
| | | | | | | | | | | | | | This replaces all MemSet() calls with struct initialization where that is easily and obviously possible. (For example, some cases have to worry about padding bits, so I left those.) (The same could be done with appropriate memset() calls, but this patch is part of an effort to phase out MemSet(), so it doesn't touch memset() calls.) Reviewed-by: Ranier Vilela <ranier.vf@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://www.postgresql.org/message-id/9847b13c-b785-f4e2-75c3-12ec77a3b05c@enterprisedb.com
* Add checkpoint and REDO LSN to log_checkpoints message.Fujii Masao2022-07-07
| | | | | | | | | | | | | It is useful for debugging purposes to report the checkpoint LSN and REDO LSN in log_checkpoints message. It can give more context while analyzing checkpoint-related issues. pg_controldata reports the last checkpoint LSN and REDO LSN, but having this information alongside the log message helps analyze issues that happened previously, connect the dots and identify the root cause. Author: Bharath Rupireddy, Kyotaro Horiguchi Reviewed-by: Michael Paquier, Julien Rouhaud, Nathan Bossart, Fujii Masao, Greg Stark Discussion: https://postgr.es/m/CALj2ACWt6kqriAHrO+AJj+OmP=suwbktHT5JoYAn-nqZe2gd2g@mail.gmail.com
* Overload index_form_tuple to allow the memory context to be suppliedDavid Rowley2022-07-07
| | | | | | | | | | | | | | | | | | | | | | | 40af10b57 changed things so we make use of a generation memory context for storing tuples to be sorted by tuplesort.c. That change does not play nicely with the changes made in 9f03ca915 (back in 2014). That commit changed things so that index_form_tuple() is called while switched into the tuplestore's tuplecontext. In order to fetch the tuple from the index, index_form_tuple() must do various memory allocations which are unrelated to the storage of the final returned tuple. Although all of these allocations are pfree'd, the fact that we now use a generation context means that the memory for these pfree'd allocations won't be used again by any other allocation due to generation.c's lack of freelists. This could result in sorts used for building indexes exceeding maintenance_work_mem by a very large amount. Here we fix it so we no longer allocate anything apart from the tuple itself into the generation context by adding a new version of index_form_tuple named index_form_tuple_context, which can be called to specify the MemoryContext to allocate the tuple into. Discussion: https://postgr.es/m/CAApHDvrHQkiFRHiGiAS-LMOvJN-eK-s762=tVzBz8ZqUea-a_A@mail.gmail.com Backpatch-through: 15, where 40af10b57 was added.
* Change internal RelFileNode references to RelFileNumber or RelFileLocator.Robert Haas2022-07-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have been using the term RelFileNode to refer to either (1) the integer that is used to name the sequence of files for a certain relation within the directory set aside for that tablespace/database combination; or (2) that value plus the OIDs of the tablespace and database; or occasionally (3) the whole series of files created for a relation based on those values. Using the same name for more than one thing is confusing. Replace RelFileNode with RelFileNumber when we're talking about just the single number, i.e. (1) from above, and with RelFileLocator when we're talking about all the things that are needed to locate a relation's files on disk, i.e. (2) from above. In the places where we refer to (3) as a relfilenode, instead refer to "relation storage". Since there is a ton of SQL code in the world that knows about pg_class.relfilenode, don't change the name of that column, or of other SQL-facing things that derive their name from it. On the other hand, do adjust closely-related internal terminology. For example, the structure member names dbNode and spcNode appear to be derived from the fact that the structure itself was called RelFileNode, so change those to dbOid and spcOid. Likewise, various variables with names like rnode and relnode get renamed appropriately, according to how they're being used in context. Hopefully, this is clearer than before. It is also preparation for future patches that intend to widen the relfilenumber fields from its current width of 32 bits. Variables that store a relfilenumber are now declared as type RelFileNumber rather than type Oid; right now, these are the same, but that can now more easily be changed. Dilip Kumar, per an idea from me. Reviewed also by Andres Freund. I fixed some whitespace issues, changed a couple of words in a comment, and made one other minor correction. Discussion: http://postgr.es/m/CA+TgmoamOtXbVAQf9hWFzonUo6bhhjS6toZQd7HZ-pmojtAmag@mail.gmail.com Discussion: http://postgr.es/m/CA+Tgmobp7+7kmi4gkq7Y+4AM9fTvL+O1oQ4-5gFTT+6Ng-dQ=g@mail.gmail.com Discussion: http://postgr.es/m/CAFiTN-vTe79M8uDH1yprOU64MNFE+R3ODRuA+JWf27JbhY4hJw@mail.gmail.com
* Refactor sending of DataRow messages in replication protocolPeter Eisentraut2022-07-06
| | | | | | | | | | | | Some routines open-coded the construction of DataRow messages. Use TupOutputState struct and associated functions instead, which was already done in some places. SendTimeLineHistory() is a bit more complicated and isn't converted by this. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/7e4fdbdc-699c-4cd0-115d-fb78a957fc22@enterprisedb.com
* Replace durable_rename_excl() by durable_rename(), take twoMichael Paquier2022-07-05
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | durable_rename_excl() attempts to avoid overwriting any existing files by using link() and unlink(), and it falls back to rename() on some platforms (aka WIN32), which offers no such overwrite protection. Most callers use durable_rename_excl() just in case there is an existing file, but in practice there shouldn't be one (see below for more details). Furthermore, failures during durable_rename_excl() can result in multiple hard links to the same file. As per Nathan's tests, it is possible to end up with two links to the same file in pg_wal after a crash just before unlink() during WAL recycling. Specifically, the test produced links to the same file for the current WAL file and the next one because the half-recycled WAL file was re-recycled upon restarting, leading to WAL corruption. This change replaces all the calls of durable_rename_excl() to durable_rename(). This removes the protection against accidentally overwriting an existing file, but some platforms are already living without it and ordinarily there shouldn't be one. The function itself is left around in case any extensions are using it. It will be removed on HEAD via a follow-up commit. Here is a summary of the existing callers of durable_rename_excl() (see second discussion link at the bottom), replaced by this commit. First, basic_archive used it to avoid overwriting an archive concurrently created by another server, but as mentioned above, it will still overwrite files on some platforms. Second, xlog.c uses it to recycle past WAL segments, where an overwrite should not happen (origin of the change at f0e37a8) because there are protections about the WAL segment to select when recycling an entry. The third and last area is related to the write of timeline history files. writeTimeLineHistory() will write a new timeline history file at the end of recovery on promotion, so there should be no such files for the same timeline. What remains is writeTimeLineHistoryFile(), that can be used in parallel by a WAL receiver and the startup process, and some digging of the buildfarm shows that EEXIST from a WAL receiver can happen with an error of "could not link file \"pg_wal/xlogtemp.NN\" to \"pg_wal/MM.history\", which would cause an automatic restart of the WAL receiver as it is promoted to FATAL, hence this should improve the stability of the WAL receiver as rename() would overwrite an existing TLI history file already fetched by the startup process at recovery. This is a bug fix, but knowing the unlikeliness of the problem involving one or more crashes at an exceptionally bad moment, no backpatch is done. Also, I want to be careful with such changes (aaa3aed did the opposite of this change by removing HAVE_WORKING_LINK so as Windows would do a link() rather than a rename() but this was not concurrent-safe). A backpatch could be revisited in the future. This is the second time this change is attempted, ccfbd92 being the first one, but this time no assertions are added for the case of a TLI history file written concurrently by the WAL receiver or the startup process because we can expect one to exist (some of the TAP tests are able to trigger with a proper timing). Author: Nathan Bossart Reviewed-by: Robert Haas, Kyotaro Horiguchi, Michael Paquier Discussion: https://postgr.es/m/20220407182954.GA1231544@nathanxps13 Discussion: https://postgr.es/m/Ym6GZbqQdlalSKSG@paquier.xyz
* Refactor sending of RowDescription messages in replication protocolPeter Eisentraut2022-07-04
| | | | | | | | | Some routines open-coded the construction of RowDescription messages. Instead, we have support for doing this using tuple descriptors and DestRemoteSimple, so use that instead. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/7e4fdbdc-699c-4cd0-115d-fb78a957fc22@enterprisedb.com
* Add construct_array_builtin, deconstruct_array_builtinPeter Eisentraut2022-07-01
| | | | | | | | | | | | | | | There were many calls to construct_array() and deconstruct_array() for built-in types, for example, when dealing with system catalog columns. These all hardcoded the type attributes necessary to pass to these functions. To simplify this a bit, add construct_array_builtin(), deconstruct_array_builtin() as wrappers that centralize this hardcoded knowledge. This simplifies many call sites and reduces the amount of hardcoded stuff that is spread around. Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> Discussion: https://www.postgresql.org/message-id/flat/2914356f-9e5f-8c59-2995-5997fc48bcba%40enterprisedb.com
* Fix code comments still referring to pg_start/stop_backup()Michael Paquier2022-07-01
| | | | | | | | | pg_start_backup() and pg_stop_backup() have been respectively renamed to pg_backup_start() and pg_backup_stop() as of 39969e2, but a few comments did not get the call. Reviewed-by: Kyotaro Horiguchi, David Steele Discussion: https://postgr.es/m/YrqGlj1+4DF3dbZ/@paquier.xyz
* Change some unnecessary MemSet callsPeter Eisentraut2022-07-01
| | | | | | | | | | MemSet() with a value other than 0 just falls back to memset(), so the indirection is unnecessary if the value is constant and not 0. Since there is some interest in getting rid of MemSet(), this gets some easy cases out of the way. (There are a few MemSet() calls that I didn't change to maintain the consistency with their surrounding code.) Discussion: https://www.postgresql.org/message-id/flat/CAEudQApCeq4JjW1BdnwU=m=-DvG5WyUik0Yfn3p6UNphiHjj+w@mail.gmail.com
* pgindent run prior to branching v15.Tom Lane2022-06-30
| | | | pgperltidy and reformat-dat-files too. Not many changes.
* Fix visibility check when XID is committed in CLOG but not in procarray.Heikki Linnakangas2022-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TransactionIdIsInProgress had a fast path to return 'false' if the single-item CLOG cache said that the transaction was known to be committed. However, that was wrong, because a transaction is first marked as committed in the CLOG but doesn't become visible to others until it has removed its XID from the proc array. That could lead to an error: ERROR: t_xmin is uncommitted in tuple to be updated or for an UPDATE to go ahead without blocking, before the previous UPDATE on the same row was made visible. The window is usually very short, but synchronous replication makes it much wider, because the wait for synchronous replica happens in that window. Another thing that makes it hard to hit is that it's hard to get such a commit-in-progress transaction into the single item CLOG cache. Normally, if you call TransactionIdIsInProgress on such a transaction, it determines that the XID is in progress without checking the CLOG and without populating the cache. One way to prime the cache is to explicitly call pg_xact_status() on the XID. Another way is to use a lot of subtransactions, so that the subxid cache in the proc array is overflown, making TransactionIdIsInProgress rely on pg_subtrans and CLOG checks. This has been broken ever since it was introduced in 2008, but the race condition is very hard to hit, especially without synchronous replication. There were a couple of reports of the error starting from summer 2021, but no one was able to find the root cause then. TransactionIdIsKnownCompleted() is now unused. In 'master', remove it, but I left it in place in backbranches in case it's used by extensions. Also change pg_xact_status() to check TransactionIdIsInProgress(). Previously, it only checked the CLOG, and returned "committed" before the transaction was actually made visible to other queries. Note that this also means that you cannot use pg_xact_status() to reproduce the bug anymore, even if the code wasn't fixed. Report and analysis by Konstantin Knizhnik. Patch by Simon Riggs, with the pg_xact_status() change added by me. Author: Simon Riggs Reviewed-by: Andres Freund Discussion: https://www.postgresql.org/message-id/flat/4da7913d-398c-e2ad-d777-f752cf7f0bbb%40garret.ru