aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
Commit message (Collapse)AuthorAge
...
* Oops, fix recoveryStopsBefore functions for regular commits.Heikki Linnakangas2014-07-29
| | | | | Pointed out by Tom Lane. Backpatch to 9.4, the code was structured differently in earlier branches and didn't have this mistake.
* Treat 2PC commit/abort the same as regular xacts in recovery.Heikki Linnakangas2014-07-29
| | | | | | | | | | | | | | | | | There were several oversights in recovery code where COMMIT/ABORT PREPARED records were ignored: * pg_last_xact_replay_timestamp() (wasn't updated for 2PC commits) * recovery_min_apply_delay (2PC commits were applied immediately) * recovery_target_xid (recovery would not stop if the XID used 2PC) The first of those was reported by Sergiy Zuban in bug #11032, analyzed by Tom Lane and Andres Freund. The bug was always there, but was masked before commit d19bd29f07aef9e508ff047d128a4046cc8bc1e2, because COMMIT PREPARED always created an extra regular transaction that was WAL-logged. Backpatch to all supported versions (older versions didn't have all the features and therefore didn't have all of the above bugs).
* Fix checkpointer crash in EXEC_BACKEND builds.Robert Haas2014-07-24
| | | | | | | | | | | | | | Nothing in the checkpointer calls InitXLOGAccess(), so WALInsertLocks never got initialized there. Without EXEC_BACKEND, it works anyway because the correct value is inherited from the postmaster, but with EXEC_BACKEND we've got a problem. The problem appears to have been introduced by commit 68a2e52bbaf98f136a96b3a0d734ca52ca440a95. To fix, move the relevant initialization steps from InitXLOGAccess() to XLOGShmemInit(), making this more parallel to what we do elsewhere. Amit Kapila
* Add missing serial commasPeter Eisentraut2014-07-15
| | | | | Also update one place where the wal_level "logical" was not added to an error message.
* Fix and enhance the assertion of no palloc's in a critical section.Heikki Linnakangas2014-06-30
| | | | | | | | | | | | The assertion failed if WAL_DEBUG or LWLOCK_STATS was enabled; fix that by using separate memory contexts for the allocations made within those code blocks. This patch introduces a mechanism for marking any memory context as allowed in a critical section. Previously ErrorContext was exempt as a special case. Instead of a blanket exception of the checkpointer process, only exempt the memory context used for the pending ops hash table.
* Have multixact be truncated by checkpoint, not vacuumAlvaro Herrera2014-06-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of truncating pg_multixact at vacuum time, do it only at checkpoint time. The reason for doing it this way is twofold: first, we want it to delete only segments that we're certain will not be required if there's a crash immediately after the removal; and second, we want to do it relatively often so that older files are not left behind if there's an untimely crash. Per my proposal in http://www.postgresql.org/message-id/20140626044519.GJ7340@eldon.alvh.no-ip.org we now execute the truncation in the checkpointer process rather than as part of vacuum. Vacuum is in only charge of maintaining in shared memory the value to which it's possible to truncate the files; that value is stored as part of checkpoints also, and so upon recovery we can reuse the same value to re-execute truncate and reset the oldest-value-still-safe-to-use to one known to remain after truncation. Per bug reported by Jeff Janes in the course of his tests involving bug #8673. While at it, update some comments that hadn't been updated since multixacts were changed. Backpatch to 9.3, where persistency of pg_multixact files was introduced by commit 0ac5ad5134f2.
* Fix bug in WAL_DEBUG.Heikki Linnakangas2014-06-23
| | | | | The record header was not copied correctly to the buffer that was passed to the rm_desc function. Broken by my rm_desc signature refactoring patch.
* Change the signature of rm_desc so that it's passed a XLogRecord.Heikki Linnakangas2014-06-14
| | | | Just feels more natural, and is more consistent with rm_redo.
* Consistency improvements for slot and decoding code.Andres Freund2014-06-12
| | | | | | | | Change the order of checks in similar functions to be the same; remove a parameter that's not needed anymore; rename a memory context and expand a couple of comments. Per review comments from Amit Kapila
* Add defenses against running with a wrong selection of LOBLKSIZE.Tom Lane2014-06-05
| | | | | | | | | | | | | | | | | | | | | It's critical that the backend's idea of LOBLKSIZE match the way data has actually been divided up in pg_largeobject. While we don't provide any direct way to adjust that value, doing so is a one-line source code change and various people have expressed interest recently in changing it. So, just as with TOAST_MAX_CHUNK_SIZE, it seems prudent to record the value in pg_control and cross-check that the backend's compiled-in setting matches the on-disk data. Also tweak the code in inv_api.c so that fetches from pg_largeobject explicitly verify that the length of the data field is not more than LOBLKSIZE. Formerly we just had Asserts() for that, which is no protection at all in production builds. In some of the call sites an overlength data value would translate directly to a security-relevant stack clobber, so it seems worth one extra runtime comparison to be sure. In the back branches, we can't change the contents of pg_control; but we can still make the extra checks in inv_api.c, which will offer some amount of protection against running with the wrong value of LOBLKSIZE.
* Consistently spell a replication slot's name as slot_name.Andres Freund2014-06-05
| | | | | | | | | | | Previously there's been a mix between 'slotname' and 'slot_name'. It's not nice to be unneccessarily inconsistent in a new feature. As a post beta1 initdb now is required in the wake of eeca4cd35e, fix the inconsistencies. Most the changes won't affect usage of replication slots because the majority of changes is around function parameter names. The prominent exception to that is that the recovery.conf parameter 'primary_slotname' is now named 'primary_slot_name'.
* Fix a bunch of functions that were declared static then defined not-static.Tom Lane2014-05-17
| | | | Per testing with a compiler that whines about this.
* Rename min_recovery_apply_delay to recovery_min_apply_delay.Tom Lane2014-05-10
| | | | | | | Per discussion, this seems like a more consistent choice of name. Fabrízio de Royes Mello, after a suggestion by Peter Eisentraut; some additional documentation wordsmithing by me
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Improve generation algorithm for database system identifier.Tom Lane2014-04-26
| | | | | | | | | | | | | As noted some time ago, the original coding had a typo ("|" for "^") that made the result less unique than intended. Even the intended behavior is obsolete since it was based on wanting to produce a usable value even if we didn't have int64 arithmetic --- a limitation we stopped supporting years ago. Instead, let's redefine the system identifier as tv_sec in the upper 32 bits (same as before), tv_usec in the next 20 bits, and the low 12 bits of getpid() in the remaining bits. This is still hardly guaranteed-universally-unique, but it's noticeably better than before. Per my proposal at <29019.1374535940@sss.pgh.pa.us>
* report stat() error in trigger file checkBruce Momjian2014-04-17
| | | | | | | Permissions might prevent the existence of the trigger file from being checked. Per report from Andres Freund
* Use correctly-sized buffer when zero-filling a WAL file.Heikki Linnakangas2014-04-16
| | | | | | I mixed up BLCKSZ and XLOG_BLCKSZ when I changed the way the buffer is allocated a couple of weeks ago. With the default settings, they are both 8k, but they can be changed at compile-time.
* Fix typo in comment.Heikki Linnakangas2014-04-10
| | | | Tomonari Katsumata
* Fix some compiler warnings that clang emits with -pedantic.Robert Haas2014-04-04
| | | | Andres Freund
* In checkpoint, move the check for in-progress xacts out of critical section.Heikki Linnakangas2014-04-04
| | | | | | GetVirtualXIDsDelayingChkpt calls palloc, which isn't safe in a critical section. I thought I covered this case with the exemption for the checkpointer, but CreateCheckPoint is also called from the startup process.
* Avoid allocations in critical sections.Heikki Linnakangas2014-04-04
| | | | If a palloc in a critical section fails, it becomes a PANIC.
* Pass more than the first XLogRecData entry to rm_desc, with WAL_DEBUG.Heikki Linnakangas2014-03-26
| | | | | | | | | | | | | | | If you compile with WAL_DEBUG and enable it with wal_debug=on, we used to only pass the first XLogRecData entry to the rm_desc routine. I think the original assumprion was that the first XLogRecData entry contains all the necessary information for the rm_desc routine, but that's a pretty shaky assumption. At least standby_redo didn't get the memo. To fix, piece together all the data in a temporary buffer, and pass that to the rm_desc routine. It's been like this forever, but the patch didn't apply cleanly to back-branches. Probably wouldn't be hard to fix the conflicts, but it's not worth the trouble.
* Don't forget to flush XLOG_PARAMETER_CHANGE record.Fujii Masao2014-03-26
| | | | Backpatch to 9.0 where XLOG_PARAMETER_CHANGE record was instroduced.
* Fix "the the" typos.Heikki Linnakangas2014-03-24
| | | | Erik Rijkers
* Replace the XLogInsert slots with regular LWLocks.Heikki Linnakangas2014-03-21
| | | | | | | | | | The special feature the XLogInsert slots had over regular LWLocks is the insertingAt value that was updated atomically with releasing backends waiting on it. Add new functions to the LWLock API to do that, and replace the slots with LWLocks. This reduces the amount of duplicated code. (There's still some duplication, but at least it's all in lwlock.c now.) Reviewed by Andres Freund.
* Remove rm_safe_restartpoint machinery.Heikki Linnakangas2014-03-18
| | | | | | | | | It is no longer used, none of the resource managers have multi-record actions that would make it unsafe to perform a restartpoint. Also don't allow rm_cleanup to write WAL records, it's also no longer required. Move the call to rm_cleanup routines to make it more symmetric with rm_startup.
* C comments: remove odd blank lines after #ifdef WIN32 linesBruce Momjian2014-03-13
|
* Only WAL-log the modified portion in an UPDATE, if possible.Heikki Linnakangas2014-03-12
| | | | | | | | | When a row is updated, and the new tuple version is put on the same page as the old one, only WAL-log the part of the new tuple that's not identical to the old. This saves significantly on the amount of WAL that needs to be written, in the common case that most fields are not modified. Amit Kapila, with a lot of back and forth with me, Robert Haas, and others.
* Do wal_level and hot standby checks when doing crash-then-archive recovery.Heikki Linnakangas2014-03-05
| | | | | | | | CheckRequiredParameterValues() should perform the checks if archive recovery was requested, even if we are going to perform crash recovery first. Reported by Kyotaro HORIGUCHI. Backpatch to 9.2, like the crash-then-archive recovery mode.
* Fix lastReplayedEndRecPtr calculation when starting from shutdown checkpoint.Heikki Linnakangas2014-03-05
| | | | | | | | | | | | | | | When entering crash recovery followed by archive recovery, and the latest checkpoint is a shutdown checkpoint, and there are no more WAL records to replay before transitioning from crash to archive recovery, we would not immediately allow read-only connections in hot standby mode even if we could. That's because when starting from a shutdown checkpoint, we set lastReplayedEndRecPtr incorrectly to the record before the checkpoint record, instead of the checkpoint record itself. We don't run the redo routine of the shutdown checkpoint record, but starting recovery from it goes through the same motions, so it should be considered as replayed. Reported by Kyotaro HORIGUCHI. All versions with hot standby are affected, so backpatch to 9.0.
* Introduce logical decoding.Robert Haas2014-03-03
| | | | | | | | | | | | | | | | | | | | | | This feature, building on previous commits, allows the write-ahead log stream to be decoded into a series of logical changes; that is, inserts, updates, and deletes and the transactions which contain them. It is capable of handling decoding even across changes to the schema of the effected tables. The output format is controlled by a so-called "output plugin"; an example is included. To make use of this in a real replication system, the output plugin will need to be modified to produce output in the format appropriate to that system, and to perform filtering. Currently, information can be extracted from the logical decoding system only via SQL; future commits will add the ability to stream changes via walsender. Andres Freund, with review and other contributions from many other people, including Álvaro Herrera, Abhijit Menon-Sen, Peter Gheogegan, Kevin Grittner, Robert Haas, Heikki Linnakangas, Fujii Masao, Abhijit Menon-Sen, Michael Paquier, Simon Riggs, Craig Ringer, and Steve Singer.
* Remove bogus while-loop.Heikki Linnakangas2014-02-28
| | | | | | | | | | Commit abf5c5c9a4f142b3343614746bb9e99a794f8e7b added a bogus while- statement after the for(;;)-loop. It went unnoticed in testing, because it was dead code. Report by KONDO Mitsumasa. Backpatch to 9.3. The commit that introduced this was also applied to 9.2, but not the bogus while-loop part, because the code in 9.2 looks quite different.
* Improve comment on setting data_checksum GUC.Heikki Linnakangas2014-02-20
| | | | There was an extra space there, and "fixed" wasn't very descriptive.
* Fix comment; checkpointer, not bgwriter, performs checkpoints since 9.2.Heikki Linnakangas2014-02-18
| | | | Amit Langote
* Prevent potential overruns of fixed-size buffers.Tom Lane2014-02-17
| | | | | | | | | | | | | | | | | | | | | | | Coverity identified a number of places in which it couldn't prove that a string being copied into a fixed-size buffer would fit. We believe that most, perhaps all of these are in fact safe, or are copying data that is coming from a trusted source so that any overrun is not really a security issue. Nonetheless it seems prudent to forestall any risk by using strlcpy() and similar functions. Fixes by Peter Eisentraut and Jozef Mlich based on Coverity reports. In addition, fix a potential null-pointer-dereference crash in contrib/chkpass. The crypt(3) function is defined to return NULL on failure, but chkpass.c didn't check for that before using the result. The main practical case in which this could be an issue is if libc is configured to refuse to execute unapproved hashing algorithms (e.g., "FIPS mode"). This ideally should've been a separate commit, but since it touches code adjacent to one of the buffer overrun changes, I included it in this commit to avoid last-minute merge issues. This issue was reported by Honza Horak. Security: CVE-2014-0065 for buffer overruns, CVE-2014-0066 for crypt()
* Change the order that pg_xlog and WAL archive are polled for WAL segments.Heikki Linnakangas2014-02-14
| | | | | | | | | | | | | | | | | | | If there is a WAL segment with same ID but different TLI present in both the WAL archive and pg_xlog, prefer the one with higher TLI. Before this patch, the archive was polled first, for all expected TLIs, and only if no file was found was pg_xlog scanned. This was a change in behavior from 9.3, which first scanned archive and pg_xlog for the highest TLI, then archive and pg_xlog for the next highest TLI and so forth. This patch reverts the behavior back to what it was in 9.2. The reason for this is that if for example you try to do archive recovery to timeline 2, which branched off timeline 1, but the WAL for timeline 2 is not archived yet, we would replay past the timeline switch point on timeline 1 using the archived files, before even looking timeline 2's files in pg_xlog Report and patch by Kyotaro Horiguchi. Backpatch to 9.3 where the behavior was changed.
* Fix WakeupWaiters() to not wake up an exclusive locker unnecessarily.Heikki Linnakangas2014-02-10
| | | | | | | | WakeupWaiters() is supposed to wake up all LW_WAIT_UNTIL_FREE waiters of the slot, but the loop incorrectly also woke up the first LW_EXCLUSIVE waiter, if there was no LW_WAIT_UNTIL_FREE waiters in the queue. Noted by Andres Freund. This code is new in 9.4, so no backpatching.
* Introduce replication slots.Robert Haas2014-01-31
| | | | | | | | | | | | | | | | Replication slots are a crash-safe data structure which can be created on either a master or a standby to prevent premature removal of write-ahead log segments needed by a standby, as well as (with hot_standby_feedback=on) pruning of tuples whose removal would cause replication conflicts. Slots have some advantages over existing techniques, as explained in the documentation. In a few places, we refer to the type of replication slots introduced by this patch as "physical" slots, because forthcoming patches for logical decoding will also have slots, but with somewhat different properties. Andres Freund and Robert Haas
* Add recovery_target='immediate' option.Heikki Linnakangas2014-01-25
| | | | | | | | This allows ending recovery as a consistent state has been reached. Without this, there was no easy way to e.g restore an online backup, without replaying any extra WAL after the backup ended. MauMau and me.
* Allow use of "z" flag in our printf calls, and use it where appropriate.Tom Lane2014-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me
* Fix multiple bugs in index page locking during hot-standby WAL replay.Tom Lane2014-01-14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In ordinary operation, VACUUM must be careful to take a cleanup lock on each leaf page of a btree index; this ensures that no indexscans could still be "in flight" to heap tuples due to be deleted. (Because of possible index-tuple motion due to concurrent page splits, it's not enough to lock only the pages we're deleting index tuples from.) In Hot Standby, the WAL replay process must likewise lock every leaf page. There were several bugs in the code for that: * The replay scan might come across unused, all-zero pages in the index. While btree_xlog_vacuum itself did the right thing (ie, nothing) with such pages, xlogutils.c supposed that such pages must be corrupt and would throw an error. This accounts for various reports of replication failures with "PANIC: WAL contains references to invalid pages". To fix, add a ReadBufferMode value that instructs XLogReadBufferExtended not to complain when we're doing this. * btree_xlog_vacuum performed the extra locking if standbyState == STANDBY_SNAPSHOT_READY, but that's not the correct test: we won't open up for hot standby queries until the database has reached consistency, and we don't want to do the extra locking till then either, for fear of reading corrupted pages (which bufmgr.c would complain about). Fix by exporting a new function from xlog.c that will report whether we're actually in hot standby replay mode. * To ensure full coverage of the index in the replay scan, btvacuumscan would emit a dummy WAL record for the last page of the index, if no vacuuming work had been done on that page. However, if the last page of the index is all-zero, that would result in corruption of said page, since the functions called on it weren't prepared to handle that case. There's no need to lock any such pages, so change the logic to target the last normal leaf page instead. The first two of these bugs were diagnosed by Andres Freund, the other one by me. Fixes based on ideas from Heikki Linnakangas and myself. This has been wrong since Hot Standby was introduced, so back-patch to 9.0.
* Refactor checking whether we've reached the recovery target.Heikki Linnakangas2014-01-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Makes the replay loop slightly more readable, by separating the concerns of whether to stop and whether to delay, and how to extract the timestamp from a record. This has the user-visible change that the timestamp of the last applied record is now updated after actually applying it. Before, it was updated just before applying it. That meant that pg_last_xact_replay_timestamp() could return the timestamp of a commit record that is in process of being replayed, but not yet applied. Normally the difference is small, but if min_recovery_apply_delay is set, there could be a significant delay between reading a record and applying it. Another behavioral change is that if you recover to a restore point, we stop after the restore point record, not before it. It makes no difference as far as running queries on the server is concerned, as applying a restore point record changes nothing, but if examine the timeline history you will see that the new timeline branched off just after the restore point record, not before it. One practical consequence is that if you do PITR to the new timeline, and set recovery target to the same named restore point again, it will find and stop recovery at the same restore point. Conceptually, I think it makes more sense to consider the restore point as part of the new timeline's history than not. In principle, setting the last-replayed timestamp before actually applying the record was a bug all along, but it doesn't seem worth the risk to backpatch, since min_recovery_apply_delay was only added in 9.4.
* Fix pause_at_recovery_target + recovery_target_inclusive combination.Heikki Linnakangas2014-01-08
| | | | | | | | | | | | If pause_at_recovery_target is set, recovery pauses *before* applying the target record, even if recovery_target_inclusive is set. If you then continue with pg_xlog_replay_resume(), it will apply the target record before ending recovery. In other words, if you log in while it's paused and verify that the database looks OK, ending recovery changes its state again, possibly destroying data that you were tring to salvage with PITR. Backpatch to 9.1, this has been broken since pause_at_recovery_target was added.
* If multiple recovery_targets are specified, use the latest one.Heikki Linnakangas2014-01-08
| | | | | | | | | | | | | The docs say that only one of recovery_target_xid, recovery_target_time, or recovery_target_name can be specified. But the code actually did something different, so that a name overrode time, and xid overrode both time and name. Now the target specified last takes effect, whether it's an xid, time or name. With this patch, we still accept multiple recovery_target settings, even though docs say that only one can be specified. It's a general property of the recovery.conf file parser that you if you specify the same option twice, the last one takes effect, like with postgresql.conf.
* Fix bug in determining when recovery has reached consistency.Heikki Linnakangas2014-01-08
| | | | | | | | | | | | | | | | | | | | | | | | | | When starting WAL replay from an online checkpoint, the last replayed WAL record variable was initialized using the checkpoint record's location, even though the records between the REDO location and the checkpoint record had not been replayed yet. That was noted as "slightly confusing" but harmless in the comment, but in some cases, it fooled CheckRecoveryConsistency to incorrectly conclude that we had already reached a consistent state immediately at the beginning of WAL replay. That caused the system to accept read-only connections in hot standby mode too early, and also PANICs with message "WAL contains references to invalid pages". Fix by initializing the variables to the REDO location instead. In 9.2 and above, change CheckRecoveryConsistency() to use lastReplayedEndRecPtr variable when checking if backup end location has been reached. It was inconsistently using EndRecPtr for that check, but lastReplayedEndRecPtr when checking min recovery point. It made no difference before this patch, because in all the places where CheckRecoveryConsistency was called the two variables were the same, but it was always an accident waiting to happen, and would have been wrong after this patch anyway. Report and analysis by Tomonari Katsumata, bug #8686. Backpatch to 9.0, where hot standby was introduced.
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
* Move permissions check from do_pg_start_backup to pg_start_backupMagnus Hagander2014-01-07
| | | | | | | | | | And the same for do_pg_stop_backup. The code in do_pg_* is not allowed to access the catalogs. For manual base backups, the permissions check can be handled in the calling function, and for streaming base backups only users with the required permissions can get past the authentication step in the first place. Reported by Antonin Houska, diagnosed by Andres Freund
* Rename walLogHints to wal_log_hints for easier grepping.Robert Haas2014-01-01
| | | | Michael Paquier
* Rename wal_log_hintbits to wal_log_hints, per discussion on pgsql-hackers.Fujii Masao2013-12-21
| | | | Sawada Masahiko
* Fix more instances of "the the" in comments.Heikki Linnakangas2013-12-13
| | | | Plus one instance of "to to" in the docs.