aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam
Commit message (Collapse)AuthorAge
...
* Add more LOG messages when starting and ending recovery from a backupMichael Paquier2024-01-25
| | | | | | | | | | | | | | Three LOG messages are added in the recovery code paths, providing information that can be useful to track corruption issues depending on the state of the cluster, telling that: - Recovery has started from a backup_label. - Recovery is restarting from a backup start LSN, without a backup_label. - Recovery has completed from a backup. Author: Andres Freund Reviewed-by: David Steele, Laurenz Albe, Michael Paquier Discussion: https://postgr.es/m/20231117041811.vz4vgkthwjnwp2pp@awork3.anarazel.de
* Update copyright for 2024Bruce Momjian2024-01-03
| | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz Backpatch-through: 12
* Fix incorrect data type choices in some read and write calls.Tom Lane2023-12-27
| | | | | | | | | | | | | | | | | | | | | | | | Recently-introduced code in reconstruct.c was using "unsigned" to store the result of read(), pg_pread(), or write(). This is completely bogus: it breaks subsequent tests for the result being negative, as we're being reminded of by a chorus of buildfarm warnings. Switch to "int" as was doubtless intended. (There are several other uses of "unsigned" in this file that also look poorly chosen to me, but for now I'm just trying to clean up the buildfarm.) A larger problem is that "int" is not necessarily wide enough to hold the result: per POSIX, all these functions return ssize_t. In places where the requested read or write length clearly fits in int, that's academic. It may be academic anyway as long as we constrain individual data files to 1GB, since even a readv or writev-like operation would then not be responsible for transferring more than 1GB. Nonetheless it seems like trouble waiting to happen, so I made a pass over readv and writev calls and fixed the result variables where that seemed appropriate. We might want to think about changing some of the fd.c functions to return ssize_t too, for future-proofing; but I didn't tackle that here. Discussion: https://postgr.es/m/1672202.1703441340@sss.pgh.pa.us
* Add support for incremental backup.Robert Haas2023-12-20
| | | | | | | | | | | | | | | | | | | | | | To take an incremental backup, you use the new replication command UPLOAD_MANIFEST to upload the manifest for the prior backup. This prior backup could either be a full backup or another incremental backup. You then use BASE_BACKUP with the INCREMENTAL option to take the backup. pg_basebackup now has an --incremental=PATH_TO_MANIFEST option to trigger this behavior. An incremental backup is like a regular full backup except that some relation files are replaced with files with names like INCREMENTAL.${ORIGINAL_NAME}, and the backup_label file contains additional lines identifying it as an incremental backup. The new pg_combinebackup tool can be used to reconstruct a data directory from a full backup and a series of incremental backups. Patch by me. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Thanks especially to Jakub for incredibly helpful and extensive testing. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
* Add a new WAL summarizer process.Robert Haas2023-12-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | When active, this process writes WAL summary files to $PGDATA/pg_wal/summaries. Each summary file contains information for a certain range of LSNs on a certain TLI. For each relation, it stores a "limit block" which is 0 if a relation is created or destroyed within a certain range of WAL records, or otherwise the shortest length to which the relation was truncated during that range of WAL records, or otherwise InvalidBlockNumber. In addition, it stores a list of blocks which have been modified during that range of WAL records, but excluding blocks which were removed by truncation after they were modified and never subsequently modified again. In other words, it tells us which blocks need to copied in case of an incremental backup covering that range of WAL records. But this doesn't yet add the capability to actually perform an incremental backup; the next patch will do that. A new parameter summarize_wal enables or disables this new background process. The background process also automatically deletes summary files that are older than wal_summarize_keep_time, if that parameter has a non-zero value and the summarizer is configured to run. Patch by me, with some design help from Dilip Kumar and Andres Freund. Reviewed by Matthias van de Meent, Dilip Kumar, Jakub Wartak, Peter Eisentraut, and Álvaro Herrera. Discussion: http://postgr.es/m/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
* Additional write barrier in AdvanceXLInsertBuffer().Jeff Davis2023-12-19
| | | | | | | | | | | | First, mark the xlblocks member with InvalidXLogRecPtr, then issue a write barrier, then initialize it. That ensures that the xlblocks member doesn't appear valid while the contents are being initialized. In preparation for reading WAL buffer contents without a lock. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
* Use 64-bit atomics for xlblocks array elements.Jeff Davis2023-12-19
| | | | | | | | | | In preparation for reading the contents of WAL buffers without a lock. Also, avoids the previously-needed comment in GetXLogBuffer() explaining why it's safe from torn reads. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVfFMfqD5oLzZSQQZWfXiJqd-NdX0_317veP6FuB31QWA@mail.gmail.com Reviewed-by: Andres Freund
* Prevent tuples to be marked as dead in subtransactions on standbysMichael Paquier2023-12-12
| | | | | | | | | | | | | | | | | | | | | | | Dead tuples are ignored and are not marked as dead during recovery, as it can lead to MVCC issues on a standby because its xmin may not match with the primary. This information is tracked by a field called "xactStartedInRecovery" in the transaction state data, switched on when starting a transaction in recovery. Unfortunately, this information was not correctly tracked when starting a subtransaction, because the transaction state used for the subtransaction did not update "xactStartedInRecovery" based on the state of its parent. This would cause index scans done in subtransactions to return inconsistent data, depending on how the xmin of the primary and/or the standby evolved. This is broken since the introduction of hot standby in efc16ea52067, so backpatch all the way down. Author: Fei Changhong Reviewed-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/tencent_C4D907A5093C071A029712E73B43C6512706@qq.com Backpatch-through: 12
* Remove trace_recovery_messagesMichael Paquier2023-12-11
| | | | | | | | | | | | This GUC was intended as a debugging help in the 9.0 area when hot standby and streaming replication were being developped, able to offer more information at LOG level rather than DEBUGn. There are more tools available these days that are able to offer rather equivalent information, like pg_waldump introduced in 9.3. It is not obvious how this facility is useful these days, so let's remove it. Author: Bharath Rupireddy Discussion: https://postgr.es/m/ZXEXEAUVFrvpquSd@paquier.xyz
* Allow parallel CREATE INDEX for BRIN indexesTomas Vondra2023-12-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow using multiple worker processes to build BRIN index, which until now was supported only for BTREE indexes. For large tables this often results in significant speedup when the build is CPU-bound. The work is split in a simple way - each worker builds BRIN summaries on a subset of the table, determined by the regular parallel scan used to read the data, and feeds them into a shared tuplesort which sorts them by blkno (start of the range). The leader then reads this sorted stream of ranges, merges duplicates (which may happen if the parallel scan does not align with BRIN pages_per_range), and adds the resulting ranges into the index. The number of duplicate results produced by workers (requiring merging in the leader process) should be fairly small, thanks to how parallel scans assign chunks to workers. The likelihood of duplicate results may increase for higher pages_per_range values, but then there are fewer page ranges in total. In any case, we expect the merging to be much cheaper than summarization, so this should be a win. Most of the parallelism infrastructure is a simplified copy of the code used by BTREE indexes, omitting the parts irrelevant for BRIN indexes (e.g. uniqueness checks). This also introduces a new index AM flag amcanbuildparallel, determining whether to attempt to start parallel workers for the index build. Original patch by me, with reviews and substantial reworks by Matthias van de Meent, certainly enough to make him a co-author. Author: Tomas Vondra, Matthias van de Meent Reviewed-by: Matthias van de Meent Discussion: https://postgr.es/m/c2ee7d69-ce17-43f2-d1a0-9811edbda6e6%40enterprisedb.com
* Rename ShmemVariableCache to TransamVariablesHeikki Linnakangas2023-12-08
| | | | | | | | The old name was misleading: It's not a cache, the values kept in the struct are the authoritative source. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
* Initialize ShmemVariableCache like other shmem areasHeikki Linnakangas2023-12-08
| | | | | | | For sake of consistency. Reviewed-by: Tristan Partin, Richard Guo Discussion: https://www.postgresql.org/message-id/6537d63d-4bb5-46f8-9b5d-73a8ba4720ab@iki.fi
* Fix potential pointer overflow in xlogreader.c.Thomas Munro2023-12-08
| | | | | | | | | | | | | | | | | | | While checking if a record could fit in the circular WAL decoding buffer, the coding from commit 3f1ce973 used arithmetic that could overflow. 64 bit systems were unaffected for various technical reasons, which probably explains the lack of problem reports. Likewise for 32 bit systems running known 32 bit kernels. The systems at risk of problems appear to be 32 bit processes running on 64 bit kernels, with unlucky placement in memory. Per complaint from GCC -fsanitize=undefined -m32, while testing variations of 039_end_of_wal.pl. Back-patch to 15. Reviewed-by: Nathan Bossart <nathandbossart@gmail.com> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGKH0oRPOX7DhiQ_b51sM8HqcPp2J3WA-Oen%3DdXog%2BAGGQ%40mail.gmail.com
* Fix compilation on Windows with WAL_DEBUGMichael Paquier2023-12-06
| | | | | | | | | | This has been broken since b060dbe0001a that has reworked the callback mechanism of XLogReader, most likely unnoticed because any form of development involving WAL happens on platforms where this compiles fine. Author: Bharath Rupireddy Discussion: https://postgr.es/m/CALj2ACVF14WKQMFwcJ=3okVDhiXpuK5f7YdT+BdYXbbypMHqWA@mail.gmail.com Backpatch-through: 13
* Fix typo in 5a1dfde8334bAlexander Korotkov2023-11-30
| | | | | Reported-by: Alexander Lakhin Discussion: https://postgr.es/m/55d8800f-4a80-5256-1e84-246fbe79acd0@gmail.com
* Fix warning due non-standard inline declaration in 4ed8f0913bfdb5f355Alexander Korotkov2023-11-30
| | | | | | Reported-by: Alexander Lakhin, Tom Lane Author: Pavel Borisov Discussion: https://postgr.es/m/55d8800f-4a80-5256-1e84-246fbe79acd0@gmail.com
* Apply quotes more consistently to GUC names in logsMichael Paquier2023-11-30
| | | | | | | | | | | | | | Quotes are applied to GUCs in a very inconsistent way across the code base, with a mix of double quotes or no quotes used. This commit removes double quotes around all the GUC names that are obviously referred to as parameters with non-English words (use of underscore, mixed case, etc). This is the result of a discussion with Álvaro Herrera, Nathan Bossart, Laurenz Albe, Peter Eisentraut, Tom Lane and Daniel Gustafsson. Author: Peter Smith Discussion: https://postgr.es/m/CAHut+Pv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w@mail.gmail.com
* Make use FullTransactionId in 2PC filenamesAlexander Korotkov2023-11-29
| | | | | | | | | | | | | | Switch from using TransactionId to FullTransactionId in naming of 2PC files. Transaction state file in the pg_twophase directory now have extra 8 bytes in the name to address an epoch of a given xid. Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
* Index SLRUs by 64-bit integers rather than by 32-bit integersAlexander Korotkov2023-11-29
| | | | | | | | | | | | | | | | | | We've had repeated bugs in the area of handling SLRU wraparound in the past, some of which have caused data loss. Switching to an indexing system for SLRUs that does not wrap around should allow us to get rid of a whole bunch of problems and improve the overall reliability of the system. This particular patch however only changes the indexing and doesn't address the wraparound per se. This is going to be done in the following patches. Author: Maxim Orlov, Aleksander Alekseev, Alexander Korotkov, Teodor Sigaev Author: Nikita Glukhov, Pavel Borisov, Yura Sokolov Reviewed-by: Jacob Champion, Heikki Linnakangas, Alexander Korotkov Reviewed-by: Japin Li, Pavel Borisov, Tom Lane, Peter Eisentraut, Andres Freund Reviewed-by: Andrey Borodin, Dilip Kumar, Aleksander Alekseev Discussion: https://postgr.es/m/CACG%3DezZe1NQSCnfHOr78AtAZxJZeCvxrts0ygrxYwe%3DpyyjVWA%40mail.gmail.com Discussion: https://postgr.es/m/CAJ7c6TPDOYBYrnCAeyndkBktO0WG2xSdYduTF0nxq%2BvfkmTF5Q%40mail.gmail.com
* Reduce rate of walwriter wakeups due to async commits.Heikki Linnakangas2023-11-27
| | | | | | | | | | | | | | | | | XLogSetAsyncXactLSN(), called at asynchronous commit, would wake up walwriter every time the LSN advances, but walwriter doesn't actually do anything unless it has at least 'wal_writer_flush_after' full blocks of WAL to write. Repeatedly waking up walwriter to do nothing is a waste of CPU cycles in both walwriter and the backends doing the wakeups. To fix, apply the same logic in XLogSetAsyncXactLSN() to decide whether to wake up walwriter, as walwriter uses to determine if it has any work to do. In the passing, rename misleadingly named 'flushbytes' local variable to 'flushblocks'. Author: Andres Freund, Heikki Linnakangas Discussion: https://www.postgresql.org/message-id/20231024230929.vsc342baqs7kmbte@awork3.anarazel.de
* C comment: clarify that WAL files can be _recycled_ or removedBruce Momjian2023-11-25
| | | | | | | | | | Reported-by: Michael Paquier Discussion: https://postgr.es/m/CAB7nPqSDdF0heotQU3gsepgqx+9c+6KjLd3R6aNYH7KKfDd2ig@mail.gmail.com Author: Michael Paquier Backpatch-through: master
* modify segno. for pg_walfile_name() and pg_walfile_name_offset()Bruce Momjian2023-11-24
| | | | | | | | | | | | | | | | | | | | | Previously these functions returned the previous segment number if the LSN was on a segment boundary. We now always return the current segment number for an LSN. Docs updated to reflect this change. Regression tests added, author Andres Freund. Also mentioned in thread https://postgr.es/m/flat/20220204225057.GA1535307%40nathanxps13#d964275c9540d8395e138efc0a75f7e8 BACKWARD INCOMPATIBILITY Reported-by: Kyotaro Horiguchi Discussion: https://postgr.es/m/20190726.172120.101752680.horikyota.ntt@gmail.com Co-authored-by: Kyotaro Horiguchi Backpatch-through: master
* Retire MemoryContextResetAndDeleteChildren() macro.Nathan Bossart2023-11-15
| | | | | | | | | | | | | | | | | As of commit eaa5808e8e, MemoryContextResetAndDeleteChildren() is just a backwards compatibility macro for MemoryContextReset(). Now that some time has passed, this macro seems more likely to create confusion. This commit removes the macro and replaces all remaining uses with calls to MemoryContextReset(). Any third-party code that use this macro will need to be adjusted to call MemoryContextReset() instead. Since the two have behaved the same way since v9.5, such adjustments won't produce any behavior changes for all currently-supported versions of PostgreSQL. Reviewed-by: Amul Sul, Tom Lane, Alvaro Herrera, Dagfinn Ilmari Mannsåker Discussion: https://postgr.es/m/20231113185950.GA1668018%40nathanxps13
* Clear CurrentResourceOwner earlier in CommitTransaction.Heikki Linnakangas2023-11-15
| | | | | | | | | | | Alexander reported a crash with repeated create + drop database, after the ResourceOwner rewrite (commit b8bff07daa). That was fixed by the previous commit, but it nevertheless seems like a good idea clear CurrentResourceOwner earlier, because you're not supposed to use it for anything after we start releasing it. Reviewed-by: Alexander Lakhin Discussion: https://www.postgresql.org/message-id/11b70743-c5f3-3910-8e5b-dd6c115ff829%40gmail.com
* Prohibit max_slot_wal_keep_size to value other than -1 during upgrade.Amit Kapila2023-11-10
| | | | | | | | | | | We don't want existing slots in the old cluster to get invalidated during the upgrade. During an upgrade, we set this variable to -1 via the command line in an attempt to prevent such invalidations, but users have ways to override it. This patch ensures the value is not overridden by the user. Author: Kyotaro Horiguchi Reviewed-by: Peter Smith, Alvaro Herrera Discussion: http://postgr.es/m/20231027.115759.2206827438943188717.horikyota.ntt@gmail.com
* Make ResourceOwners more easily extensible.Heikki Linnakangas2023-11-08
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of having a separate array/hash for each resource kind, use a single array and hash to hold all kinds of resources. This makes it possible to introduce new resource "kinds" without having to modify the ResourceOwnerData struct. In particular, this makes it possible for extensions to register custom resource kinds. The old approach was to have a small array of resources of each kind, and if it fills up, switch to a hash table. The new approach also uses an array and a hash, but now the array and the hash are used at the same time. The array is used to hold the recently added resources, and when it fills up, they are moved to the hash. This keeps the access to recent entries fast, even when there are a lot of long-held resources. All the resource-specific ResourceOwnerEnlarge*(), ResourceOwnerRemember*(), and ResourceOwnerForget*() functions have been replaced with three generic functions that take resource kind as argument. For convenience, we still define resource-specific wrapper macros around the generic functions with the old names, but they are now defined in the source files that use those resource kinds. The release callback no longer needs to call ResourceOwnerForget on the resource being released. ResourceOwnerRelease unregisters the resource from the owner before calling the callback. That needed some changes in bufmgr.c and some other files, where releasing the resources previously always called ResourceOwnerForget. Each resource kind specifies a release priority, and ResourceOwnerReleaseAll releases the resources in priority order. To make that possible, we have to restrict what you can do between phases. After calling ResourceOwnerRelease(), you are no longer allowed to remember any more resources in it or to forget any previously remembered resources by calling ResourceOwnerForget. There was one case where that was done previously. At subtransaction commit, AtEOSubXact_Inval() would handle the invalidation messages and call RelationFlushRelation(), which temporarily increased the reference count on the relation being flushed. We now switch to the parent subtransaction's resource owner before calling AtEOSubXact_Inval(), so that there is a valid ResourceOwner to temporarily hold that relcache reference. Other end-of-xact routines make similar calls to AtEOXact_Inval() between release phases, but I didn't see any regression test failures from those, so I'm not sure if they could reach a codepath that needs remembering extra resources. There were two exceptions to how the resource leak WARNINGs on commit were printed previously: llvmjit silently released the context without printing the warning, and a leaked buffer io triggered a PANIC. Now everything prints a WARNING, including those cases. Add tests in src/test/modules/test_resowner. Reviewed-by: Aleksander Alekseev, Michael Paquier, Julien Rouhaud Reviewed-by: Kyotaro Horiguchi, Hayato Kuroda, Álvaro Herrera, Zhihong Yu Reviewed-by: Peter Eisentraut, Andres Freund Discussion: https://www.postgresql.org/message-id/cbfabeb0-cd3c-e951-a572-19b365ed314d%40iki.fi
* Delay recovery mode LOG after reading backup_label and/or checkpoint recordMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | When beginning recovery, a LOG is displayed by the startup process to show which recovery mode will be used depending on the .signal file(s) set in the data folder, like "standby mode", recovery up to a given target type and value, or archive recovery. A different patch is under discussion to simplify the startup code by requiring the presence of recovery.signal and/or standby.signal when a backup_label file is read. Delaying a bit this LOG ensures that the correct recovery mode would be reported, and putting it at this position does not make it lose its value. While on it, this commit adds a few comments documenting a bit more the initial recovery steps and their dependencies, and fixes an incorrect comment format. This introduces no behavior changes. Extracted from a larger patch by me. Reviewed-by: David Steele, Bowen Shi Discussion: https://postgr.es/m/ZArVOMifjzE7f8W7@paquier.xyz
* Mention standby.signal in FATALs for checkpoint record missing at recoveryMichael Paquier2023-10-30
| | | | | | | | | | | | | | | When beginning recovery from a base backup by reading a backup_label file, it may be possible that no checkpoint record is available depending on the method used when the case backup was taken, which would prevent recovery from beginning. In this case, the FATAL messages issued, initially added by c900c15269f0f, mentioned recovery.signal as an option to do recovery but not standby.signal. Let's add it as an available option, for clarity. Per suggestion from Bowen Shi, extracted from a larger patch by me. Author: Michael Paquier Discussion: https://postgr.es/m/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
* Introduce pg_stat_checkpointerMichael Paquier2023-10-30
| | | | | | | | | | | | | | | | | | | | | | | | Historically, the statistics of the checkpointer have been always part of pg_stat_bgwriter. This commit removes a few columns from pg_stat_bgwriter, and introduces pg_stat_checkpointer with equivalent, renamed columns (plus a new one for the reset timestamp): - checkpoints_timed -> num_timed - checkpoints_req -> num_requested - checkpoint_write_time -> write_time - checkpoint_sync_time -> sync_time - buffers_checkpoint -> buffers_written The fields of PgStat_CheckpointerStats and its SQL functions are renamed to match with the new field names, for consistency. Note that background writer and checkpointer have been split into two different processes in commits 806a2aee3791 and bf405ba8e460. The pgstat structures were already split, making this change straight-forward. Bump catalog version. Author: Bharath Rupireddy Reviewed-by: Bertrand Drouvot, Andres Freund, Michael Paquier Discussion: https://postgr.es/m/CALj2ACVxX2ii=66RypXRweZe2EsBRiPMj0aHfRfHUeXJcC7kHg@mail.gmail.com
* Add trailing commas to enum definitionsPeter Eisentraut2023-10-26
| | | | | | | | | | | | | | | | | | | | Since C99, there can be a trailing comma after the last value in an enum definition. A lot of new code has been introducing this style on the fly. Some new patches are now taking an inconsistent approach to this. Some add the last comma on the fly if they add a new last value, some are trying to preserve the existing style in each place, some are even dropping the last comma if there was one. We could nudge this all in a consistent direction if we just add the trailing commas everywhere once. I omitted a few places where there was a fixed "last" value that will always stay last. I also skipped the header files of libpq and ecpg, in case people want to use those with older compilers. There were also a small number of cases where the enum type wasn't used anywhere (but the enum values were), which ended up confusing pgindent a bit, so I left those alone. Discussion: https://www.postgresql.org/message-id/flat/386f8c45-c8ac-4681-8add-e3b0852c1620%40eisentraut.org
* Assert that buffers are marked dirty before XLogRegisterBuffer().Jeff Davis2023-10-23
| | | | | | | | | | | Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a904@iki.fi Reviewed-by: Heikki Linnakangas
* Change struct tablespaceinfo's oid member from 'char *' to 'Oid'Robert Haas2023-10-23
| | | | | | | | | | | | | | | | | | | | | | | | | | This shouldn't change behavior except in the unusual case where there are file in the tablespace directory that have entirely numeric names but are nevertheless not possible names for a tablespace directory, either because their names have leading zeroes that shouldn't be there, or the value is actually zero, or because the value is too large to represent as an OID. In those cases, the directory would previously have made it into the list of tablespaceinfo objects and no longer will. Thus, base backups will now ignore such directories, instead of treating them as legitimate tablespace directories. Similarly, if entries for such tablespaces occur in a tablespace_map file, they will now be rejected as erroneous, instead of being honored. This is infrastructure for future work that wants to be able to know the tablespace of each relation that is part of a backup *as an OID*. By strengthening the up-front validation, we don't have to worry about weird cases later, and can more easily avoid repeated string->integer conversions. Patch by me, reviewed by David Steele. Discussion: http://postgr.es/m/CA+TgmoZNVeBzoqDL8xvr-nkaepq815jtDR4nJzPew7=3iEuM1g@mail.gmail.com
* During online checkpoints, insert XLOG_CHECKPOINT_REDO at redo point.Robert Haas2023-10-19
| | | | | | | | | | | | | | | | | | | | | | | | | | This allows tools that read the WAL sequentially to identify (possible) redo points when they're reached, rather than only being able to detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly much later in the WAL stream. There are other possible applications as well; see the discussion links below. Any redo location that precedes the checkpoint location should now point to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this. While adjusting the code in CreateCheckPoint() for this patch, I made it call WALInsertLockAcquireExclusive a bit later than before, since there appears to be no need for it to be held while checking whether the system is idle, whether this is an end-of-recovery checkpoint, or what the current timeline is. Bump XLOG_PAGE_MAGIC. Patch by me, based in part on earlier work from Dilip Kumar. Review by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com Discussion: http://postgr.es/m/20230614194717.jyuw3okxup4cvtbt%40awork3.anarazel.de Discussion: http://postgr.es/m/CA+hUKG+b2ego8=YNW2Ohe9QmSiReh1-ogrv8V_WZpJTqP3O+2w@mail.gmail.com
* Reword messages about impending (M)XID exhaustion.Robert Haas2023-10-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | First, we shouldn't recommend switching to single-user mode, because that's terrible advice. Especially on newer versions where VACUUM will enter emergency mode when nearing (M)XID exhaustion, it's perfectly fine to just VACUUM in multi-user mode. Doing it that way is less disruptive and avoids disabling the safeguards that prevent actual wraparound, so recommend that instead. Second, be more precise about what is going to happen (when we're nearing the limits) or what is happening (when we actually hit them). The database doesn't shut down, nor does it refuse all commands. It refuses commands that assign whichever of XIDs and MXIDs are nearly exhausted. No back-patch. The existing hint that advises going to single-user mode is sufficiently awful advice that removing it or changing it might be justifiable even though we normally avoid changing user-facing messages in back-branches, but I (rhaas) felt that it was better to be more conservative and limit this fix to master only. Aside from the usual risk of breaking translations, people might be used to the existing message, or even have monitoring scripts that look for it. Alexander Alekseev, John Naylor, Robert Haas, reviewed at various times by Peter Geoghegan, Hannu Krosing, and Andres Freund. Discussion: http://postgr.es/m/CA+TgmoZBg95FiR9wVQPAXpGPRkacSt2okVge+PKPPFppN7sfnQ@mail.gmail.com
* Talk about assigning, rather than generating, new MultiXactIds.Robert Haas2023-10-17
| | | | | | | | | The word "assign" is used in various places internally to describe what GetNewMultiXactId does, but the user-facing messages have previously said "generate". For consistency, standardize on "assign," which seems (at least to me) to be slightly clearer. Discussion: http://postgr.es/m/CA+TgmoaoE1_i3=4-7GCTtKLVZVQ2Gh6qESW2VG1OprtycxOHMA@mail.gmail.com
* Move extra code out of the Pre/PostRestoreCommand() section.Nathan Bossart2023-10-16
| | | | | | | | | | | | | | | | | | | | If SIGTERM is received within this section, the startup process will immediately proc_exit() in the signal handler, so it is inadvisable to include any more code than is required there (as such code is unlikely to be compatible with doing proc_exit() in a signal handler). This commit moves the code recently added to this section (see 1b06d7bac9 and 7fed801135) to outside of the section. This ensures that the startup process only calls proc_exit() in its SIGTERM handler for the duration of the system() call, which is how this code worked from v8.4 to v14. Reported-by: Michael Paquier, Thomas Munro Analyzed-by: Andres Freund Suggested-by: Tom Lane Reviewed-by: Michael Paquier, Robert Haas, Thomas Munro, Andres Freund Discussion: https://postgr.es/m/Y9nGDSgIm83FHcad%40paquier.xyz Discussion: https://postgr.es/m/20230223231503.GA743455%40nathanxps13 Backpatch-through: 15
* Improve the naming in wal_sync_method code.Nathan Bossart2023-10-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | * sync_method is renamed to wal_sync_method. * sync_method_options[] is renamed to wal_sync_method_options[]. * assign_xlog_sync_method() is renamed to assign_wal_sync_method(). * The names of the available synchronization methods are now prefixed with "WAL_SYNC_METHOD_" and have been moved into a WalSyncMethod enum. * PLATFORM_DEFAULT_SYNC_METHOD is renamed to PLATFORM_DEFAULT_WAL_SYNC_METHOD, and DEFAULT_SYNC_METHOD is renamed to DEFAULT_WAL_SYNC_METHOD. These more descriptive names help distinguish the code for wal_sync_method from the code for DataDirSyncMethod (e.g., the recovery_init_sync_method configuration parameter and the --sync-method option provided by several frontend utilities). This change also prevents name collisions between the aforementioned sets of code. Since this only improves the naming of internal identifiers, there should be no behavior change. Author: Maxim Orlov Discussion: https://postgr.es/m/CACG%3DezbL1gwE7_K7sr9uqaCGkWhmvRTcTEnm3%2BX1xsRNwbXULQ%40mail.gmail.com
* Add wait events for checkpoint delay mechanism.Thomas Munro2023-10-13
| | | | | | | | | | | When MyProc->delayChkptFlags is set to temporarily block phase transitions in a concurrent checkpoint, the checkpointer enters a sleep-poll loop to wait for the flag to be cleared. We should show that as a wait event in the pg_stat_activity view. Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/CA%2BhUKGL7Whi8iwKbzkbn_1fixH3Yy8aAPz7mfq6Hpj7FeJrKMg%40mail.gmail.com
* Unify two isLogSwitch tests in XLogInsertRecord.Robert Haas2023-10-12
| | | | | | | | | | | | | | | | | An upcoming patch wants to introduce an additional special case in this function. To keep that as cheap as possible, minimize the amount of branching that we do based on whether this is an XLOG_SWITCH record. Additionally, and also in the interest of keeping the overhead of special-case code paths as low as possible, apply likely() to the non-XLOG_SWITCH case, since only a very tiny fraction of WAL records will be XLOG_SWITCH records. Patch by me, reviewed by Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier. Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
* Reindent comment in GenericXLogFinish().Tom Lane2023-10-11
| | | | Restore pgindent cleanliness, per buildfarm member koel.
* Fix bug in GenericXLogFinish().Jeff Davis2023-10-10
| | | | | | | | Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c094a@iki.fi Reviewed-by: Heikki Linnakangas Backpatch-through: 11
* Remove duplicate words in docs and code comments.Amit Kapila2023-10-09
| | | | | | | Additionally, add a missing "the" in a couple of places. Author: Vignesh C, Dagfinn Ilmari Mannsåker Discussion: http://postgr.es/m/CALDaNm28t+wWyPfuyqEaARS810Je=dRFkaPertaLAEJYY2cWYQ@mail.gmail.com
* Tidy-up some appendStringInfo*() usagesDavid Rowley2023-10-03
| | | | | | | | | | | | Make a few newish calls to appendStringInfo() which have no special formatting use appendStringInfoString() instead. Also, adjust usages of appendStringInfoString() which only append a string containing a single character to make use of appendStringInfoChar() instead. This makes the code marginally faster, but primarily this change is so we use the StringInfo type as it was intended to be used. Discussion: https://postgr.es/m/CAApHDvpXKQmL+r=VDNS98upqhr9yGBhv2Jw3GBFFk_wKHcB39A@mail.gmail.com
* Fail hard on out-of-memory failures in xlogreader.cMichael Paquier2023-10-03
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit changes the WAL reader routines so as a FATAL for the backend or exit(FAILURE) for the frontend is triggered if an allocation for a WAL record decode fails in walreader.c, rather than treating this case as bogus data, which would be equivalent to the end of WAL. The key is to avoid palloc_extended(MCXT_ALLOC_NO_OOM) in walreader.c, relying on plain palloc() calls. The previous behavior could make WAL replay finish too early than it should. For example, crash recovery finishing earlier may corrupt clusters because not all the WAL available locally was replayed to ensure a consistent state. Out-of-memory failures would show up randomly depending on the memory pressure on the host, but one simple case would be to generate a large record, then replay this record after downsizing a host, as Ethan Mertz originally reported. This relies on bae868caf222, as the WAL reader routines now do the memory allocation required for a record only once its header has been fully read and validated, making xl_tot_len trustable. Making the WAL reader react differently on out-of-memory or bogus record data would require ABI changes, so this is the safest choice for stable branches. Also, it is worth noting that 3f1ce973467a has been using a plain palloc() in this code for some time now. Thanks to Noah Misch and Thomas Munro for the discussion. Like the other commit, backpatch down to 12, leaving out v11 that will be EOL'd soon. The behavior of considering a failed allocation as bogus data comes originally from 0ffe11abd3a0, where the record length retrieved from its header was not entirely trustable. Reported-by: Ethan Mertz Discussion: https://postgr.es/m/ZRKKdI5-RRlta3aF@paquier.xyz Backpatch-through: 12
* Correct assertion and comments about XLogRecordMaxSize.Noah Misch2023-10-01
| | | | | | The largest allocation, of xl_tot_len+8192, is in allocate_recordbuf(). Discussion: https://postgr.es/m/20230812211327.GB2326466@rfd.leadboat.com
* Fix typo in src/backend/access/transam/README.Etsuro Fujita2023-09-28
|
* Fix edge-case for xl_tot_len broken by bae868ca.Thomas Munro2023-09-26
| | | | | | | | | | | | | | | | | bae868ca removed a check that was still needed. If you had an xl_tot_len at the end of a page that was too small for a record header, but not big enough to span onto the next page, we'd immediately perform the CRC check using a bogus large length. Because of arbitrary coding differences between the CRC implementations on different platforms, nothing very bad happened on common modern systems. On systems using the _sb8.c fallback we could segfault. Restore that check, add a new assertion and supply a test for that case. Back-patch to 12, like bae868ca. Tested-by: Tom Lane <tgl@sss.pgh.pa.us> Tested-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGLCkTT7zYjzOxuLGahBdQ%3DMcF%3Dz5ZvrjSOnW4EDhVjT-g%40mail.gmail.com
* Don't trust unvalidated xl_tot_len.Thomas Munro2023-09-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xl_tot_len comes first in a WAL record. Usually we don't trust it to be the true length until we've validated the record header. If the record header was split across two pages, previously we wouldn't do the validation until after we'd already tried to allocate enough memory to hold the record, which was bad because it might actually be garbage bytes from a recycled WAL file, so we could try to allocate a lot of memory. Release 15 made it worse. Since 70b4f82a4b5, we'd at least generate an end-of-WAL condition if the garbage 4 byte value happened to be > 1GB, but we'd still try to allocate up to 1GB of memory bogusly otherwise. That was an improvement, but unfortunately release 15 tries to allocate another object before that, so you could get a FATAL error and recovery could fail. We can fix both variants of the problem more fundamentally using pre-existing page-level validation, if we just re-order some logic. The new order of operations in the split-header case defers all memory allocation based on xl_tot_len until we've read the following page. At that point we know that its first few bytes are not recycled data, by checking its xlp_pageaddr, and that its xlp_rem_len agrees with xl_tot_len on the preceding page. That is strong evidence that xl_tot_len was truly the start of a record that was logged. This problem was most likely to occur on a standby, because walreceiver.c recycles WAL files without zeroing out trailing regions of each page. We could fix that too, but it wouldn't protect us from rare crash scenarios where the trailing zeroes don't make it to disk. With reliable xl_tot_len validation in place, the ancient policy of considering malloc failure to indicate corruption at end-of-WAL seems quite surprising, but changing that is left for later work. Also included is a new TAP test to exercise various cases of end-of-WAL detection by writing contrived data into the WAL from Perl. Back-patch to 12. We decided not to put this change into the final release of 11. Author: Thomas Munro <thomas.munro@gmail.com> Author: Michael Paquier <michael@paquier.xyz> Reported-by: Alexander Lakhin <exclusion@gmail.com> Reviewed-by: Noah Misch <noah@leadboat.com> (the idea, not the code) Reviewed-by: Michael Paquier <michael@paquier.xyz> Reviewed-by: Sergei Kornilov <sk@zsrv.org> Reviewed-by: Alexander Lakhin <exclusion@gmail.com> Discussion: https://postgr.es/m/17928-aa92416a70ff44a2%40postgresql.org
* Fix COMMIT/ROLLBACK AND CHAIN in the presence of subtransactions.Tom Lane2023-09-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In older branches, COMMIT/ROLLBACK AND CHAIN failed to propagate the current transaction's properties to the new transaction if there was any open subtransaction (unreleased savepoint). Instead, some previous transaction's properties would be restored. This is because the "if (s->chain)" check in CommitTransactionCommand examined the wrong instance of the "chain" flag and falsely concluded that it didn't need to save transaction properties. Our regression tests would have noticed this, except they used identical transaction properties for multiple tests in a row, so that the faulty behavior was not distinguishable from correct behavior. Commit 12d768e70 fixed the problem in v15 and later, but only rather accidentally, because I removed the "if (s->chain)" test to avoid a compiler warning, while not realizing that the warning was flagging a real bug. In v14 and before, remove the if-test and save transaction properties unconditionally; just as in the newer branches, that's not expensive enough to justify thinking harder. Add the comment and extra regression test to v15 and later to forestall any future recurrence, but there's no live bug in those branches. Patch by me, per bug #18118 from Liu Xiang. Back-patch to v12 where the AND CHAIN feature was added. Discussion: https://postgr.es/m/18118-4b72fcbb903aace6@postgresql.org
* Rename variable for code clarityDaniel Gustafsson2023-09-15
| | | | | | | | | | | When tracking IO timing for WAL, the duration is what we calculate based on the start and end timestamps, it's not what the variable contains. Rename the timestamp variable to end to better communicate what it contains. Original patch by Krishnakumar with additional hacking to fix another occurrence by me. Author: Krishnakumar R <kksrcv001@gmail.com> Discussion: https://postgr.es/m/CAPMWgZ9f9o8awrQpjo8oxnNQ=bMDVPx00NE0QcDzvHD_ZrdLPw@mail.gmail.com