aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/transam/xlog.c
Commit message (Collapse)AuthorAge
...
* Add a GUC to report whether data page checksums are enabled.Heikki Linnakangas2013-09-16
| | | | Bernd Helmle
* Revert WAL posix_fallocate() patches.Jeff Davis2013-09-04
| | | | | | | | | | This reverts commit 269e780822abb2e44189afaccd6b0ee7aefa7ddd and commit 5b571bb8c8d2bea610e01ae1ee7bc05adcfff528. Unfortunately, the initial patch had insufficient performance testing, and resulted in a regression. Per report by Thom Brown.
* Keep heavily-contended fields in XLogCtlInsert on different cache lines.Heikki Linnakangas2013-09-04
| | | | | | | Performance testing shows that if the insertpos_lck spinlock and the fields that it protects are on the same cache line with other variables that are frequently accessed, the false sharing can hurt performance a lot. Keep them apart by adding some padding.
* Rename the "fast_promote" file to just "promote".Heikki Linnakangas2013-08-19
| | | | | | | | | | This keeps the usual trigger file name unchanged from 9.2, avoiding nasty issues if you use a pre-9.3 pg_ctl binary with a 9.3 server or vice versa. The fallback behavior of creating a full checkpoint before starting up is now triggered by a file called "fallback_promote". That can be useful for debugging purposes, but we don't expect any users to have to resort to that and we might want to remove that in the future, which is why the fallback mechanism is undocumented.
* Message punctuation and pluralization fixesPeter Eisentraut2013-08-09
|
* Message style improvementsPeter Eisentraut2013-07-28
|
* Fix variable names mentioned in comment to match the code.Heikki Linnakangas2013-07-17
| | | | | | | Also, in another comment, explain why holding an insertion slot is a critical section. Per review by Amit Kapila.
* Fix assert failure at end of recovery, broken by XLogInsert scaling patch.Heikki Linnakangas2013-07-17
| | | | | | | | | | | | | | | | | | Initialization of the first XLOG buffer at end-of-recovery was broken for the case that the last read WAL record ended at a page boundary. Instead of trying to copy the last full xlog page to the buffer cache in that case, just set shared state so that the next page is initialized when the first WAL record after startup is inserted. (that's what we did in earlier version, too) To make the shared state required for that case less surprising, replace the XLogCtl->curridx variable, which was the index of the latest initialized buffer, with an XLogRecPtr of how far the buffers have been initialized. That also allows us to get rid of the XLogRecEndPtrToBufIdx macro. While we're at it, make a similar change for XLogCtl->Write.curridx, getting rid of that variable and calculating the next buffer to write from XLogCtl->LogwrtResult instead.
* Fix Windows build.Heikki Linnakangas2013-07-08
| | | | | | Was broken by my xloginsert scaling patch. XLogCtl global variable needs to be initialized in each process, as it's not inherited by fork() on Windows.
* Improve scalability of WAL insertions.Heikki Linnakangas2013-07-08
| | | | | | | | | | | | | | | | | | | | | | This patch replaces WALInsertLock with a number of WAL insertion slots, allowing multiple backends to insert WAL records to the WAL buffers concurrently. This is particularly useful for parallel loading large amounts of data on a system with many CPUs. This has one user-visible change: switching to a new WAL segment with pg_switch_xlog() now fills the remaining unused portion of the segment with zeros. This potentially adds some overhead, but it has been a very common practice by DBA's to clear the "tail" of the segment with an external pg_clearxlogtail utility anyway, to make the WAL files compress better. With this patch, it's no longer necessary to do that. This patch adds a new GUC, xloginsert_slots, to tune the number of WAL insertion slots. Performance testing suggests that the default, 8, works pretty well for all kinds of worklods, but I left the GUC in place to allow others with different hardware to test that easily. We might want to remove that before release. Reviewed by Andres Freund.
* Handle posix_fallocate() errors.Jeff Davis2013-07-06
| | | | | | | | | On some platforms, posix_fallocate() is available but may still return EINVAL if the underlying filesystem does not support it. So, in case of an error, fall through to the alternate implementation that just writes zeros. Per buildfarm failure and analysis by Tom Lane.
* Use posix_fallocate() for new WAL files, where available.Jeff Davis2013-07-05
| | | | | | | | This function is more efficient than actually writing out zeroes to the new file, per microbenchmarks by Jon Nelson. Also, it may reduce the likelihood of WAL file fragmentation. Jon Nelson, with review by Andres Freund, Greg Smith and me.
* Add new GUC, max_worker_processes, limiting number of bgworkers.Robert Haas2013-07-04
| | | | | | | | | | | | | | | | | | | | | | | | In 9.3, there's no particular limit on the number of bgworkers; instead, we just count up the number that are actually registered, and use that to set MaxBackends. However, that approach causes problems for Hot Standby, which needs both MaxBackends and the size of the lock table to be the same on the standby as on the master, yet it may not be desirable to run the same bgworkers in both places. 9.3 handles that by failing to notice the problem, which will probably work fine in nearly all cases anyway, but is not theoretically sound. A further problem with simply counting the number of registered workers is that new workers can't be registered without a postmaster restart. This is inconvenient for administrators, since bouncing the postmaster causes an interruption of service. Moreover, there are a number of applications for background processes where, by necessity, the background process must be started on the fly (e.g. parallel query). While this patch doesn't actually make it possible to register new background workers after startup time, it's a necessary prerequisite. Patch by me. Review by Michael Paquier.
* Retry short writes when flushing WAL.Heikki Linnakangas2013-07-01
| | | | | | | | | | | | | We don't normally bother retrying when the number of bytes written by write() is short of what was requested. It is generally assumed that a write() to disk doesn't return short, unless you run out of disk space. While writing the WAL, however, it seems prudent to try a bit harder, because a failure leads to PANIC. The write() is also much larger than most write()s in the backend (up to wal_buffers), so there's more room for surprises. Also retry on EINTR. All signals used in the backend are flagged SA_RESTART nowadays, so it shouldn't happen, but better to be defensive.
* Ensure no xid gaps during Hot Standby startupSimon Riggs2013-06-23
| | | | | | | | | In some cases with higher numbers of subtransactions it was possible for us to incorrectly initialize subtrans leading to complaints of missing pages. Bug report by Sergey Konoplev Analysis and fix by Andres Freund
* Add buffer_std flag to MarkBufferDirtyHint().Jeff Davis2013-06-17
| | | | | | | | | | MarkBufferDirtyHint() writes WAL, and should know if it's got a standard buffer or not. Currently, the only callers where buffer_std is false are related to the FSM. In passing, rename XLOG_HINT to XLOG_FPI, which is more descriptive. Back-patch to 9.3.
* Remove special-case treatment of LOG severity level in standalone mode.Tom Lane2013-06-13
| | | | | | | | | | | | | elog.c has historically treated LOG messages as low-priority during bootstrap and standalone operation. This has led to confusion and even masked a bug, because the normal expectation of code authors is that elog(LOG) will put something into the postmaster log, and that wasn't happening during initdb. So get rid of the special-case rule and make the priority order the same as it is in normal operation. To keep from cluttering initdb's output and the behavior of a standalone backend, tweak the severity level of three messages routinely issued by xlog.c during startup and shutdown so that they won't appear in these cases. Per my proposal back in December.
* Observe array length in HaveVirtualXIDsDelayingChkpt().Noah Misch2013-06-12
| | | | | | | | Since commit f21bb9cfb5646e1793dcc9c0ea697bab99afa523, this function ignores the caller-provided length and loops until it finds a terminator, which GetVirtualXIDsDelayingChkpt() never adds. Restore the previous loop control logic. In passing, revert the addition of an unused variable by the same commit, presumably a debugging relic.
* Fix typo in comment.Heikki Linnakangas2013-06-06
|
* Code review of recycling WAL segments in a restartpoint.Heikki Linnakangas2013-06-03
| | | | | | | | Seems cleaner to get the currently-replayed TLI in the same call to GetXLogReplayRecPtr that we get the WAL position. Make it more clear in the comment what the code does when recovery has already ended (RecoveryInProgress() will set ThisTimeLineID in that case). Finally, make resetting ThisTimeLineID afterwards more explicit.
* Post-pgindent cleanupStephen Frost2013-06-01
| | | | | | | | | | Make slightly better decisions about indentation than what pgindent is capable of. Mostly breaking out long function calls into one line per argument, with a few other minor adjustments. No functional changes- all whitespace. pgindent ran cleanly (didn't change anything) after. Passes all regressions.
* pgindent run for release 9.3Bruce Momjian2013-05-29
| | | | | This is the first run of the Perl-based pgindent script. Also update pgindent instructions.
* After fast promotion use CHECKPOINT_FORCESimon Riggs2013-05-21
| | | | | | | Not necessary for correctness, just to make log_checkpoints output look less singular. Requested by Fujii Masao
* Maintain ThisTimeLineID correctly in checkpointerSimon Riggs2013-05-21
| | | | | | | | | | | | checkpointer needs to reset ThisTimeLineID after a restartpoint to allow installing/recycling new WAL files. If recovery has already ended this would leave ThisTimeLineID set incorrectly and so we must reset it otherwise later checkpoints do not have the correct timeline. Bug report by Heikki Linnakangas. Further investigation by Heikki and myself.
* Init crash recovery using the latest available TLISimon Riggs2013-05-19
| | | | | | | | This simplifies the handling of crashes after fast promotion and various minor cases that can exist in short timing windows around that case. Broad fix to bug reported by Michael Paquier on -hackers, approach prompted by Heikki Linnakangas
* Emit msg correctly for timeline-crossing crashSimon Riggs2013-05-19
|
* Remove single space on end of a line in xlog.cSimon Riggs2013-05-19
| | | | Michael Paquier
* Fix walsender failure at promotion.Heikki Linnakangas2013-05-08
| | | | | | | | | | | | | | | | | | | | | | | | If a standby server has a cascading standby server connected to it, it's possible that WAL has already been sent up to the next WAL page boundary, splitting a WAL record in the middle, when the first standby server is promoted. Don't throw an assertion failure or error in walsender if that happens. Also, fix a variant of the same bug in pg_receivexlog: if it had already received WAL on previous timeline up to a segment boundary, when the upstream standby server is promoted so that the timeline switch record falls on the previous segment, pg_receivexlog would miss the segment containing the timeline switch. To fix that, have walsender send the position of the timeline switch at end-of-streaming, in addition to the next timeline's ID. It was previously assumed that the switch happened exactly where the streaming stopped. Note: this is an incompatible change in the streaming protocol. You might get an error if you try to stream over timeline switches, if the client is running 9.3beta1 and the server is more recent. It should be fine after a reconnect, however. Reported by Fujii Masao.
* Record data_checksum_version in control file.Simon Riggs2013-04-30
| | | | | | The value is not used anywhere in code, but will allow future changes to the checksum version should that become necessary in the future.
* Make fast promotion the default promotion mode.Simon Riggs2013-04-24
| | | | | Continue to allow a request for synchronous checkpoints as a mechanism in case of problems.
* Fix calculation of how many segments to retain for wal_keep_segments.Heikki Linnakangas2013-04-08
| | | | | | | KeepLogSeg function was broken when we switched to use a 64-bit int for the segment number. Per report from Jeff Janes.
* Skip extraneous locking in XLogCheckBuffer().Simon Riggs2013-04-08
| | | | | | | Heikki reported comment was wrong, so fixed code to match the comment: we only need to take additional locking precautions when we have a shared lock on the buffer.
* Avoid tricky race condition recording XLOG_HINTSimon Riggs2013-04-08
| | | | | | | | | | | | | | | | | | We copy the buffer before inserting an XLOG_HINT to avoid WAL CRC errors caused by concurrent hint writes to buffer while share locked. To make this work we refactor RestoreBackupBlock() to allow an XLOG_HINT to avoid the normal path for backup blocks, which assumes the underlying buffer is exclusive locked. Resulting code completely changes layout of XLOG_HINT WAL records, but this isn't even beta code, so this is a low impact change. In passing, avoid taking WALInsertLock for full page writes on checksummed hints, remove related cruft from XLogInsert() and improve xlog_desc record for XLOG_HINT. Andres Freund Bug report by Fujii Masao, testing by Jeff Janes and Jaime Casanova, review by Jeff Davis and Simon Riggs. Applied with changes from review and some comment editing.
* Make REPLICATION privilege checks test current user not authenticated user.Tom Lane2013-04-01
| | | | | | | | | | | The pg_start_backup() and pg_stop_backup() functions checked the privileges of the initially-authenticated user rather than the current user, which is wrong. For example, a user-defined index function could successfully call these functions when executed by ANALYZE within autovacuum. This could allow an attacker with valid but low-privilege database access to interfere with creation of routine backups. Reported and fixed by Noah Misch. Security: CVE-2013-1901
* Revoke bc5334d8679c428a709d150666b288171795bd76Simon Riggs2013-03-28
|
* Allow external recovery_config_directorySimon Riggs2013-03-27
| | | | | If required, recovery.conf can now be located outside of the data directory. Server needs read/write permissions on this directory.
* Allow I/O reliability checks using 16-bit checksumsSimon Riggs2013-03-22
| | | | | | | | | | | | | | | | | | | Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.
* Remove PageSetTLI and rename pd_tli to pd_checksumSimon Riggs2013-03-18
| | | | | | | | | | | | | | Remove use of PageSetTLI() from all page manipulation functions and adjust README to indicate change in the way we make changes to pages. Repurpose those bytes into the pd_checksum field and explain how that works in comments about page header. Refactoring ahead of actual feature patch which would make use of the checksum field, arriving later. Jeff Davis, with comments and doc changes by Simon Riggs Direction suggested by Robert Haas; many others providing review comments.
* Move pqsignal() to libpgport.Tom Lane2013-03-17
| | | | | | | | | We had two copies of this function in the backend and libpq, which was already pretty bogus, but it turns out that we need it in some other programs that don't use libpq (such as pg_test_fsync). So put it where it probably should have been all along. The signal-mask-initialization support in src/backend/libpq/pqsignal.c stays where it is, though, since we only need that in the backend.
* Fix tli history file fetching, broken by the archive after crash recevery patch.Heikki Linnakangas2013-03-07
| | | | | | | | | | | | | | If we were about to enter archive recovery after crash recovery, we scanned the archive for the latest tli history file, and set the recovery target timeline to that. However, when we actually tried to read the history file, we would not fetch the file from the archive, because we were not in archive recovery yet. To fix, make readTimeLineHistory and existsTimeLineHistory to always fetch the file from archive if archive recovery is requested, even if we're not in archive recovery yet. Backpatch to 9.2. Mitsumasa KONDO
* Fix thinko in previous commit.Heikki Linnakangas2013-02-22
| | | | | | We must still initialize minRecoveryPoint if we start straight with archive recovery, e.g when recovering from a normal base backup taken with pg_start/stop_backup. Otherwise we never consider the system consistent.
* If recovery.conf is created after "pg_ctl stop -m i", do crash recovery.Heikki Linnakangas2013-02-22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If you create a base backup using an atomic filesystem snapshot, and try to perform PITR starting from that base backup, or if you just kill a master server and create recovery.conf to put it into standby mode, we don't know how far we need to recover before reaching consistency. Normally in crash recovery, we replay all the WAL present in pg_xlog, and assume that we're consistent after that. And normally in archive recovery, minRecoveryPoint, backupEndRequired, or backupEndPoint is set in the control file, indicating how far we need to replay to reach consistency. But if the server was previously up and running normally, and you kill -9 it or take an atomic filesystem snapshot, none of those fields are set in the control file. The solution is to perform crash recovery first, replaying all the WAL in pg_xlog. After that's done, we assume that the system is consistent like in normal crash recovery, and switch to archive recovery mode after that. Per report from Kyotaro HORIGUCHI. In his scenario, recovery.conf was created after "pg_ctl stop -m i". I'm not sure we need to support that exact scenario, but we should support backing up using a filesystem snapshot, which looks identical. This issue goes back to at least 9.0, where hot standby was introduced and we started to track when consistency is reached. In 9.1 and 9.2, we would open up for hot standby too early, and queries could briefly see an inconsistent state. But 9.2 made it more visible, as we started to PANIC if we see a reference to a non-existing page during recovery, if we've already reached consistency. This is a fairly big patch, so back-patch to 9.2 only, where the issue is more visible. We can consider back-patching further after this has received some more testing in 9.2 and master.
* Better fix for "unarchived WAL files get deleted on crash recovery" bug.Heikki Linnakangas2013-02-15
| | | | | | | | | | Revert my earlier fix for the bug that unarchived WAL files get deleted on crash recovery, commit c9cc7e05c6d82a9781883a016c70d95aa4923122. We create a .done file for files streamed or restored from archive, so the WAL file recycling logic used during normal operation works just as well during archive recovery. Per Fujii Masao's suggestion.
* Don't delete unarchived WAL files during crash recovery.Heikki Linnakangas2013-02-15
| | | | | Bug reported by Jehan-Guillaume (ioguix) de Rorthais. This was introduced with the change to keep WAL files restored from archive in pg_xlog, in 9.2.
* Support unlogged GiST index.Heikki Linnakangas2013-02-11
| | | | | | | | | | | | | | | The reason this wasn't supported before was that GiST indexes need an increasing sequence to detect concurrent page-splits. In a regular WAL- logged GiST index, the LSN of the page-split record is used for that purpose, and in a temporary index, we can get away with a backend-local counter. Neither of those methods works for an unlogged relation. To provide such an increasing sequence of numbers, create a "fake LSN" counter that is saved and restored across shutdowns. On recovery, unlogged relations are blown away, so the counter doesn't need to survive that either. Jeevan Chalke, based on discussions with Robert Haas, Tom Lane and me.
* Fix checkpoint after fast promotion.Heikki Linnakangas2013-02-11
| | | | | | | | | | The intention was to request a regular online checkpoint immediately after end of recovery, when performing "fast promotion". However, because the checkpoint was requested before other backends were allowed to write WAL, the checkpointer process performed a restartpoint rather than a checkpoint. Delay the RequestCheckPoint call until after recovery has truly ended, so that you get a real checkpoint.
* Include previous TLI in end-of-recovery and shutdown checkpoint records.Heikki Linnakangas2013-02-11
| | | | | | This isn't used for anything but a sanity check at the moment, but it could be highly valuable for debugging purposes. It could also be used to recreate timeline history by traversing WAL, which seems useful.
* Rely only on checkpoint 1 at end of recovery.Simon Riggs2013-02-07
| | | | | | | Searching for checkpoint 2 (previous) is not correct in all cases. Bug report from Heikki Linnakangas
* Switch timelines if we crash soon after promotion.Simon Riggs2013-01-31
| | | | | | | | | Previous patch to skip checkpoints at end of recovery didn't correctly perform crash recovery, fumbling the timeline switch. Now we record the minRecoveryPointTLI of the newly selected timeline, so that we crash recover to the correct timeline. Bug report from Fujii Masao, investigated by me.
* Fast promote mode skips checkpoint at end of recovery.Simon Riggs2013-01-29
| | | | | | | | | | | pg_ctl promote -m fast will skip the checkpoint at end of recovery so that we can achieve very fast failover when the apply delay is low. Write new WAL record XLOG_END_OF_RECOVERY to allow us to switch timeline correctly for downstream log readers. If we skip synchronous end of recovery checkpoint we request a normal spread checkpoint so that the window of re-recovery is low. Simon Riggs and Kyotaro Horiguchi, with input from Fujii Masao. Review by Heikki Linnakangas