aboutsummaryrefslogtreecommitdiff
path: root/src/backend/storage
Commit message (Collapse)AuthorAge
* shm_mq: After a send fails with SHM_MQ_DETACHED, later ones should too.Robert Haas2016-06-06
| | | | | | | | | | | | | | | | Prior to this patch, it was occasionally possible, after shm_mq_sendv had previously returned SHM_MQ_DETACHED, for a later shm_mq_sendv operation to fail an assertion instead of just again returning SHM_MQ_ATTACHED. From the shm_mq code's point of view, it was expecting to be called again with the same arguments, since the previous operation had only partially completed. However, a caller who isn't using non-blocking mode won't be prepared to repeat the call with the same arguments, and this code shouldn't expect that they will. Repair in such a way that we'll be OK whether the next call uses the same arguments or not. Found by Andreas Seltenreich. Analysis and sketch of fix by Amit Kapila. Patch by me, reviewed by Amit Kapila.
* Fix whitespacePeter Eisentraut2016-06-05
|
* Fix various common mispellings.Greg Stark2016-06-03
| | | | | | | | | | Mostly these are just comments but there are a few in documentation and a handful in code and tests. Hopefully this doesn't cause too much unnecessary pain for backpatching. I relented from some of the most common like "thru" for that reason. The rest don't seem numerous enough to cause problems. Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
* Be conservative about alignment requirements of struct epoll_event.Greg Stark2016-06-02
| | | | | | | | | | | | Use MAXALIGN size/alignment to guarantee that later uses of memory are aligned correctly. E.g. epoll_event might need 8 byte alignment on some platforms, but earlier allocations like WaitEventSet and WaitEvent might not sized to guarantee that when purely using sizeof(). Found by myself while testing on an Sun Ultra 5 (Sparc IIi) with some editorializing by Andres Freund. In passing fix a couple typos in the area
* Fix PageAddItem BRIN bugAlvaro Herrera2016-05-30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BRIN was relying on the ability to remove a tuple from an index page, then putting another tuple in the same line pointer. But PageAddItem refuses to add a tuple beyond the first free item past the last used item, and in particular, it rejects an attempt to add an item to an empty page anywhere other than the first line pointer. PageAddItem issues a WARNING and indicates to the caller that it failed, which in turn causes the BRIN calling code to issue a PANIC, so the whole sequence looks like this: WARNING: specified item offset is too large PANIC: failed to add BRIN tuple To fix, create a new function PageAddItemExtended which is like PageAddItem except that the two boolean arguments become a flags bitmap; the "overwrite" and "is_heap" boolean flags in PageAddItem become PAI_OVERWITE and PAI_IS_HEAP flags in the new function, and a new flag PAI_ALLOW_FAR_OFFSET enables the behavior required by BRIN. PageAddItem() retains its original signature, for compatibility with third-party modules (other callers in core code are not modified, either). Also, in the belt-and-suspenders spirit, I added a new sanity check in brinGetTupleForHeapBlock to raise an error if an TID found in the revmap is not marked as live by the page header. This causes it to react with "ERROR: corrupted BRIN index" to the bug at hand, rather than a hard crash. Backpatch to 9.5. Bug reported by Andreas Seltenreich as detected by his handy sqlsmith fuzzer. Discussion: https://www.postgresql.org/message-id/87mvni77jh.fsf@elite.ansel.ydns.eu
* Fix range check for effective_io_concurrencyAlvaro Herrera2016-05-24
| | | | | | | | Commit 1aba62ec moved the range check of that option form guc.c into bufmgr.c, but introduced a bug by changing a >= 0.0 to > 0.0, which made the value 0 no longer accepted. Put it back. Reported by Jeff Janes, diagnosed by Tom Lane
* Fix transient mdsync() errors of truncated relations due to 72a98a6395.Andres Freund2016-05-04
| | | | | | | | | | | | | | | | | | | | | Unfortunately the segment size checks from 72a98a6395 had the negative side-effect of breaking a corner case in mdsync(): When processing a fsync request for a truncated away segment mdsync() could fail with "could not fsync file" (if previous segment < RELSEG_SIZE) because _mdfd_getseg() now wouldn't return the relevant segment anymore. The cleanest fix seems to be to allow the caller of _mdfd_getseg() to specify whether checks for RELSEG_SIZE are performed. To allow doing so, change the ExtensionBehavior enum into a bitmask. Besides allowing for the addition of EXTENSION_DONT_CHECK_SIZE, this makes for a nicer implementation of EXTENSION_REALLY_RETURN_NULL. Besides mdsync() the only callsite that should change behaviour due to this is mdprefetch() which now doesn't create segments anymore, even in recovery. Given the uses of mdprefetch() that seems better. Reported-By: Thom Brown Discussion: CAA-aLv72QazLvPdKZYpVn4a_Eh+i4_cxuB03k+iCuZM_xjc+6Q@mail.gmail.com
* Don't open formally non-existent segments in _mdfd_getseg().Andres Freund2016-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before this commit _mdfd_getseg(), in contrast to mdnblocks(), did not verify whether all segments leading up to the to-be-opened one, were RELSEG_SIZE sized. That is e.g. not the case after truncating a relation, because later segments just get truncated to zero length, not removed. Once a "non-existent" segment has been opened in a session, mdnblocks() will return wrong results, causing errors like "could not read block %u in file" when accessing blocks. Closing the session, or the later arrival of relevant invalidation messages, would "fix" the problem. That, so far, was mostly harmless, because most segment accesses are only done after an mdnblocks() call. But since 428b1d6b29ca we try to open segments that might have been deleted, to trigger kernel writeback from a backend's queue of recent writes. To fix check segment sizes in _mdfd_getseg() when opening previously unopened segments. In practice this shouldn't imply a lot of additional lseek() calls, because mdnblocks() will most of the time already have opened all relevant segments. This commit also fixes a second problem, namely that _mdfd_getseg( EXTENSION_RETURN_NULL) extends files during recovery, which is not desirable for the mdwriteback() case. Add EXTENSION_REALLY_RETURN_NULL, which does not behave that way, and use it. Reported-By: Thom Brown Author: Andres Freund, Abhijit Menon-Sen Reviewd-By: Robert Haas, Fabien Coehlo Discussion: CAA-aLv6Dp_ZsV-44QA-2zgkqWKQq=GedBX2dRSrWpxqovXK=Pg@mail.gmail.com Fixes: 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf
* Emit invalidations to standby for transactions without xid.Andres Freund2016-04-26
| | | | | | | | | | | | | | | | | | | | | | | | | | | So far, when a transaction with pending invalidations, but without an assigned xid, committed, we simply ignored those invalidation messages. That's problematic, because those are actually sent for a reason. Known symptoms of this include that existing sessions on a hot-standby replica sometimes fail to notice new concurrently built indexes and visibility map updates. The solution is to WAL log such invalidations in transactions without an xid. We considered to alternatively force-assign an xid, but that'd be problematic for vacuum, which might be run in systems with few xids. Important: This adds a new WAL record, but as the patch has to be back-patched, we can't bump the WAL page magic. This means that standbys have to be updated before primaries; otherwise "PANIC: standby_redo: unknown op code 32" errors can be encountered. XXX: Reported-By: Васильев Дмитрий, Masahiko Sawada Discussion: CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com CAD21AoDpZ6Xjg=gFrGPnSn4oTRRcwK1EBrWCq9OqOHuAcMMC=w@mail.gmail.com
* Fix assorted defects in 09adc9a8c09c9640de05c7023b27fb83c761e91c.Robert Haas2016-04-21
| | | | | | | | | | | | | | That commit increased all shared memory allocations to the next higher multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation started on a cache line boundary. It also failed to remove a couple other pieces of now-useless code. BUFFERALIGN() is perhaps obsolete at this point, and likely should be removed at some point, too, but that seems like it can be left to a future cleanup. Mistakes all pointed out by Andres Freund. The patch is mine, with a few extra assertions which I adopted from his version of this fix.
* Inline initial comparisons in TestForOldSnapshot()Kevin Grittner2016-04-21
| | | | | | | | | | | | Even with old_snapshot_threshold = -1 (which disables the "snapshot too old" feature), performance regressions were seen at moderate to high concurrency. For example, a one-socket, four-core system running 200 connections at saturation could see up to a 2.3% regression, with larger regressions possible on NUMA machines. By inlining the early (smaller, faster) tests in the TestForOldSnapshot() function, the i7 case dropped to a 0.2% regression, which could easily just be noise, and is clearly an improvement. Further testing will show whether more is needed.
* Revert no-op changes to BufferGetPage()Kevin Grittner2016-04-20
| | | | | | | | | | | | | | | | | | The reverted changes were intended to force a choice of whether any newly-added BufferGetPage() calls needed to be accompanied by a test of the snapshot age, to support the "snapshot too old" feature. Such an accompanying test is needed in about 7% of the cases, where the page is being used as part of a scan rather than positioning for other purposes (such as DML or vacuuming). The additional effort required for back-patching, and the doubt whether the intended benefit would really be there, have indicated it is best just to rely on developers to do the right thing based on comments and existing usage, as we do with many other conventions. This change should have little or no effect on generated executable code. Motivated by the back-patching pain of Tom Lane and Robert Haas
* Make partition-lock-release coding more transparent in BufferAlloc().Tom Lane2016-04-18
| | | | | | | | | | | | | | | | | | | | | Coverity complained that oldPartitionLock was possibly dereferenced after having been set to NULL. That actually can't happen, because we'd only use it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is justified in complaining, because at line 1275 we actually overwrite oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to whether to release the oldPartitionLock. Thus, the code would be incorrect if someone else had changed the buffer's BM_TAG_VALID flag meanwhile. That should not happen, since we hold pin on the buffer throughout this sequence, but it's starting to look like a rather shaky chain of logic. And there's no need for such assumptions, because we can simply replace the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL), which has identical results and makes it plain to all comers that we don't dereference a null pointer. A small side benefit is that the range of liveness of oldFlags is greatly reduced, possibly allowing the compiler to save a register. This is just cleanup, not an actual bug fix, so there seems no need for a back-patch.
* Adjust spin.c's spinlock emulation so that 0 is not a valid spinlock value.Tom Lane2016-04-16
| | | | | | | | | | | | | | | We've had repeated troubles over the years with failures to initialize spinlocks correctly; see 6b93fcd14 for a recent example. Most of the time, on most platforms, such oversights can escape notice because all-zeroes is the expected initial content of an slock_t variable. The only platform we have where the initialized state of an slock_t isn't zeroes is HPPA, and that's practically gone in the wild. To make it easier to catch such errors without needing one of those, adjust the --disable-spinlocks code so that zero is not a valid value for an slock_t for it. In passing, remove a bunch of unnecessary #include's from spin.c; commit daa7527afc227443 removed all the intermodule coupling that made them necessary.
* Fix portability problem induced by commit a6f6b7819.Tom Lane2016-04-15
| | | | | | | | pg_xlogdump includes bufmgr.h. With a compiler that emits code for static inline functions even when they're unreferenced, that leads to unresolved external references in the new static-inline version of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done for similar issues elsewhere. Per buildfarm member pademelon.
* Make init_spin_delay() C89 compliant #2.Andres Freund2016-04-14
| | | | | | | | | | | My previous attempt at doing so, in 80abbeba23, was not sufficient. While that fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant expressions in the struct initializer, because the file/line/function information comes from the caller of s_lock(). Give up on using a macro, and use a static inline instead. Discussion: 4369.1460435533@sss.pgh.pa.us
* Make init_spin_delay() C89 compliant and change stuck spinlock reporting.Andres Freund2016-04-13
| | | | | | | | | | | | | | | | | | | | | | The current definition of init_spin_delay (introduced recently in 48354581a) wasn't C89 compliant. It's not legal to refer to refer to non-constant expressions, and the ptr argument was one. This, as reported by Tom, lead to a failure on buildfarm animal pademelon. The pointer, especially on system systems with ASLR, isn't super helpful anyway, though. So instead of making init_spin_delay into an inline function, make s_lock_stuck() report the function name in addition to file:line and change init_spin_delay() accordingly. While not a direct replacement, the function name is likely more useful anyway (line numbers are often hard to interpret in third party reports). This also fixes what file/line number is reported for waits via s_lock(). As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h. Reported-By: Tom Lane Discussion: 4369.1460435533@sss.pgh.pa.us
* Avoid atomic operation in MarkLocalBufferDirty().Andres Freund2016-04-13
| | | | | | | | | | | | | | | | | | | | | | The recent patch to make Pin/UnpinBuffer lockfree in the hot path (48354581a), accidentally used pg_atomic_fetch_or_u32() in MarkLocalBufferDirty(). Other code operating on local buffers was careful to only use pg_atomic_read/write_u32 which just read/write from memory; to avoid unnecessary overhead. On its own that'd just make MarkLocalBufferDirty() slightly less efficient, but in addition InitLocalBuffers() doesn't call pg_atomic_init_u32() - thus the spinlock fallback for the atomic operations isn't initialized. That in turn caused, as reported by Tom, buildfarm animal gaur to fail. As those errors are actually useful against this type of error, continue to omit - intentionally this time - initialization of the atomic variable. In addition, add an explicit note about only using pg_atomic_read/write on local buffers's state to BufferDesc's description. Reported-By: Tom Lane Discussion: 1881.1460431476@sss.pgh.pa.us
* Widen amount-to-flush arguments of FileWriteback and callers.Tom Lane2016-04-13
| | | | | | It's silly to define these counts as narrower than they might someday need to be. Also, I believe that the BLCKSZ * nflush calculation in mdwriteback was capable of overflowing an int.
* Fix assorted portability issues with using msync() for data flushing.Tom Lane2016-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 428b1d6b29ca599c5700d4bc4f4ce4c5880369bf introduced the use of msync() for flushing dirty data from the kernel's file buffers. Several portability issues were overlooked, though: * Not all implementations of mmap() think that nbytes == 0 means "map the whole file". To fix, use lseek() to find out the true length. Fix callers of pg_flush_data to be aware that nbytes == 0 may result in trashing the file's seek position. * Not all implementations of mmap() will accept partial-page mmap requests. To fix, round down the length request to whatever sysconf() says the page size is. (I think this is OK from a portability standpoint, because sysconf() is required by SUS v2, and we aren't trying to compile this part on Windows anyway. Buildfarm should let us know if not.) * On 32-bit machines, the file size might exceed the available free address space, or even exceed what will fit in size_t. Check for the latter explicitly to avoid passing a false request size to mmap(). If mmap fails, silently fall through to the next implementation method, rather than bleating to the postmaster log and giving up. * mmap'ing directories fails on some platforms, and even if it works, msync'ing the directory is quite unlikely to help, as for that matter are the other flush implementations. In pre_sync_fname(), just skip flush attempts on directories. In passing, copy-edit the comments a bit. Stas Kelvich and myself
* Avoid extra locks in GetSnapshotData if old_snapshot_threshold < 0Kevin Grittner2016-04-12
| | | | | | | | | | | | On a big NUMA machine with 1000 connections in saturation load there was a performance regression due to spinlock contention, for acquiring values which were never used. Just fill with dummy values if we're not going to use them. This patch has not been benchmarked yet on a big NUMA machine, but it seems like a good idea on general principle, and it seemed to prevent an apparent 2.2% regression on a single-socket i7 box running 200 connections at saturation load.
* Use static inline function for BufferGetPage()Kevin Grittner2016-04-11
| | | | | | | | | | | | I was initially concerned that the some of the hundreds of references to BufferGetPage() where the literal BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a macro, leading to some hard-to-find performance regressions in corner cases. Inspection of disassembled code has shown identical code at all inspected locations, and the size difference doesn't amount to even one byte per such call. So make it readable. Per gripes from Álvaro Herrera and Tom Lane
* Avoid the use of a separate spinlock to protect a LWLock's wait queue.Andres Freund2016-04-10
| | | | | | | | | | | | | | | | Previously we used a spinlock, in adition to the atomically manipulated ->state field, to protect the wait queue. But it's pretty simple to instead perform the locking using a flag in state. Due to 6150a1b0 BufferDescs, on platforms (like PPC) with > 1 byte spinlocks, increased their size above 64byte. As 64 bytes are the size we pad allocated BufferDescs to, this can increase false sharing; causing performance problems in turn. Together with the previous commit this reduces the size to <= 64 bytes on all common platforms. Author: Andres Freund Discussion: CAA4eK1+ZeB8PMwwktf+3bRS0Pt4Ux6Rs6Aom0uip8c6shJWmyg@mail.gmail.com 20160327121858.zrmrjegmji2ymnvr@alap3.anarazel.de
* Allow Pin/UnpinBuffer to operate in a lockfree manner.Andres Freund2016-04-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pinning/Unpinning a buffer is a very frequent operation; especially in read-mostly cache resident workloads. Benchmarking shows that in various scenarios the spinlock protecting a buffer header's state becomes a significant bottleneck. The problem can be reproduced with pgbench -S on larger machines, but can be considerably worse for queries which touch the same buffers over and over at a high frequency (e.g. nested loops over a small inner table). To allow atomic operations to be used, cram BufferDesc's flags, usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable; that allows to manipulate them together using 32bit compare-and-swap operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could be lifted by using a 64bit field, but it's not a realistic configuration atm). As not all operations can easily implemented in a lockfree manner, implement the previous buf_hdr_lock via a flag bit in the atomic variable. That way we can continue to lock the header in places where it's needed, but can get away without acquiring it in the more frequent hot-paths. There's some additional operations which can be done without the lock, but aren't in this patch; but the most important places are covered. As bufmgr.c now essentially re-implements spinlocks, abstract the delay logic from s_lock.c into something more generic. It now has already two users, and more are coming up; there's a follupw patch for lwlock.c at least. This patch is based on a proof-of-concept written by me, which Alexander Korotkov made into a fully working patch; the committed version is again revised by me. Benchmarking and testing has, amongst others, been provided by Dilip Kumar, Alexander Korotkov, Robert Haas. On a large x86 system improvements for readonly pgbench, with a high client count, of a factor of 8 have been observed. Author: Alexander Korotkov and Andres Freund Discussion: 2400449.GjM57CE0Yg@dinodell
* Add the "snapshot too old" featureKevin Grittner2016-04-08
| | | | | | | | | | | | | | | | This feature is controlled by a new old_snapshot_threshold GUC. A value of -1 disables the feature, and that is the default. The value of 0 is just intended for testing. Above that it is the number of minutes a snapshot can reach before pruning and vacuum are allowed to remove dead tuples which the snapshot would otherwise protect. The xmin associated with a transaction ID does still protect dead tuples. A connection which is using an "old" snapshot does not get an error unless it accesses a page modified recently enough that it might not be able to produce accurate results. This is similar to the Oracle feature, and we use the same SQLSTATE and error message for compatibility.
* Modify BufferGetPage() to prepare for "snapshot too old" featureKevin Grittner2016-04-08
| | | | | | | | | | | This patch is a no-op patch which is intended to reduce the chances of failures of omission once the functional part of the "snapshot too old" patch goes in. It adds parameters for snapshot, relation, and an enum to specify whether the snapshot age check needs to be done for the page at this point. This initial patch passes NULL for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the third. The follow-on patch will change the places where the test needs to be made.
* Extend relations multiple blocks at a time to improve scalability.Robert Haas2016-04-08
| | | | | | | | | Contention on the relation extension lock can become quite fierce when multiple processes are inserting data into the same relation at the same time at a high rate. Experimentation shows the extending the relation multiple blocks at a time improves scalability. Dilip Kumar, reviewed by Petr Jelinek, Amit Kapila, and me.
* Revert bf08f2292ffca14fd133aa0901d1563b6ecd6894Simon Riggs2016-04-06
| | | | Remove recent changes to logging XLOG_RUNNING_XACTS by request.
* Align all shared memory allocations to cache line boundaries.Robert Haas2016-04-05
| | | | | | | | Experimentation shows this only costs about 6kB, which seems well worth it given the major performance effects that can be caused by insufficient alignment, especially on larger systems. Discussion: 14166.1458924422@sss.pgh.pa.us
* Avoid archiving XLOG_RUNNING_XACTS on idle serverSimon Riggs2016-04-04
| | | | | | | | | | If archive_timeout > 0 we should avoid logging XLOG_RUNNING_XACTS if idle. Bug 13685 reported by Laurence Rowe, investigated in detail by Michael Paquier, though this is not his proposed fix. 20151016203031.3019.72930@wrigleys.postgresql.org Simple non-invasive patch to allow later backpatch to 9.4 and 9.5
* Make all the declarations of WaitEventSetWaitBlock be marked "inline".Tom Lane2016-04-02
| | | | | The inconsistency here triggered compiler warnings on some buildfarm members, and it's surely pretty pointless.
* Copyedit comments and documentation.Noah Misch2016-04-01
|
* Fix typo in comment.Robert Haas2016-03-28
| | | | Thomas Munro
* Fix LWLockReportWaitEnd() parameter list to be (void).Andres Freund2016-03-27
| | | | Previously it was an "old style" function declaration.
* Don't use !! but != 0/NULL to force boolean evaluation.Andres Freund2016-03-27
| | | | | | | I introduced several uses of !! to force bit arithmetic to be boolean, but per discussion the project prefers != 0/NULL. Discussion: CA+TgmoZP5KakLGP6B4vUjgMBUW0woq_dJYi0paOz-My0Hwt_vQ@mail.gmail.com
* Introduce WaitEventSet API.Andres Freund2016-03-21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit ac1d794 ("Make idle backends exit if the postmaster dies.") introduced a regression on, at least, large linux systems. Constantly adding the same postmaster_alive_fds to the OSs internal datastructures for implementing poll/select can cause significant contention; leading to a performance regression of nearly 3x in one example. This can be avoided by using e.g. linux' epoll, which avoids having to add/remove file descriptors to the wait datastructures at a high rate. Unfortunately the current latch interface makes it hard to allocate any persistent per-backend resources. Replace, with a backward compatibility layer, WaitLatchOrSocket with a new WaitEventSet API. Users can allocate such a Set across multiple calls, and add more than one file-descriptor to wait on. The latter has been added because there's upcoming postgres features where that will be helpful. In addition to the previously existing poll(2), select(2), WaitForMultipleObjects() implementations also provide an epoll_wait(2) based implementation to address the aforementioned performance problem. Epoll is only available on linux, but that is the most likely OS for machines large enough (four sockets) to reproduce the problem. To actually address the aforementioned regression, create and use a long-lived WaitEventSet for FE/BE communication. There are additional places that would benefit from a long-lived set, but that's a task for another day. Thanks to Amit Kapila, who helped make the windows code I blindly wrote actually work. Reported-By: Dmitry Vasilyev Discussion: CAB-SwXZh44_2ybvS5Z67p_CDz=XFn4hNAD=CnMEF+QqkXwFrGg@mail.gmail.com 20160114143931.GG10941@awork2.anarazel.de
* Combine win32 and unix latch implementations.Andres Freund2016-03-21
| | | | | | | | | | | | | Previously latches for windows and unix had been implemented in different files. A later patch introduce an expanded wait infrastructure, keeping the implementation separate would introduce too much duplication. This basically just moves the functions, without too much change. The reason to keep this separate is that it allows blame to continue working a little less badly; and to make review a tiny bit easier. Discussion: 20160114143931.GG10941@awork2.anarazel.de
* Add idle_in_transaction_session_timeout.Robert Haas2016-03-16
| | | | | Vik Fearing, reviewed by Stéphane Schildknecht and me, and revised slightly by me.
* Fix typos.Robert Haas2016-03-15
| | | | Oskari Saarenmaa
* Include portability/mem.h into fd.c for MAP_FAILED.Andres Freund2016-03-12
| | | | | | Buildfarm members gaur and pademelon are old enough not to know about MAP_FAILED; which is used in 428b1d6. Include portability/mem.h to fix; as already done in a bunch of other places.
* Fix a typo, and remove unnecessary pgstat_report_wait_end().Robert Haas2016-03-11
| | | | Per Amit Kapila.
* Simplify GetLockNameFromTagType.Robert Haas2016-03-10
| | | | | | The old code is wrong, because it returns a pointer to an automatic variable. And it's also more clever than we really need to be considering that the case it's worrying about should never happen.
* Blindly try to fix dtrace enabled builds, broken in 9cd00c45.Andres Freund2016-03-10
| | | | | Reported-By: Peter Eisentraut Discussion: 56E2239E.1050607@gmx.net
* Checkpoint sorting and balancing.Andres Freund2016-03-10
| | | | | | | | | | | | | | | | | | | | | | | | Up to now checkpoints were written in the order they're in the BufferDescriptors. That's nearly random in a lot of cases, which performs badly on rotating media, but even on SSDs it causes slowdowns. To avoid that, sort checkpoints before writing them out. We currently sort by tablespace, relfilenode, fork and block number. One of the major reasons that previously wasn't done, was fear of imbalance between tablespaces. To address that balance writes between tablespaces. The other prime concern was that the relatively large allocation to sort the buffers in might fail, preventing checkpoints from happening. Thus pre-allocate the required memory in shared memory, at server startup. This particularly makes it more efficient to have checkpoint flushing enabled, because that'll often result in a lot of writes that can be coalesced into one flush. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund
* Allow to trigger kernel writeback after a configurable number of writes.Andres Freund2016-03-10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently writes to the main data files of postgres all go through the OS page cache. This means that some operating systems can end up collecting a large number of dirty buffers in their respective page caches. When these dirty buffers are flushed to storage rapidly, be it because of fsync(), timeouts, or dirty ratios, latency for other reads and writes can increase massively. This is the primary reason for regular massive stalls observed in real world scenarios and artificial benchmarks; on rotating disks stalls on the order of hundreds of seconds have been observed. On linux it is possible to control this by reducing the global dirty limits significantly, reducing the above problem. But global configuration is rather problematic because it'll affect other applications; also PostgreSQL itself doesn't always generally want this behavior, e.g. for temporary files it's undesirable. Several operating systems allow some control over the kernel page cache. Linux has sync_file_range(2), several posix systems have msync(2) and posix_fadvise(2). sync_file_range(2) is preferable because it requires no special setup, whereas msync() requires the to-be-flushed range to be mmap'ed. For the purpose of flushing dirty data posix_fadvise(2) is the worst alternative, as flushing dirty data is just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages from the page cache. Thus the feature is enabled by default only on linux, but can be enabled on all systems that have any of the above APIs. While desirable and likely possible this patch does not contain an implementation for windows. With the infrastructure added, writes made via checkpointer, bgwriter and normal user backends can be flushed after a configurable number of writes. Each of these sources of writes controlled by a separate GUC, checkpointer_flush_after, bgwriter_flush_after and backend_flush_after respectively; they're separate because the number of flushes that are good are separate, and because the performance considerations of controlled flushing for each of these are different. A later patch will add checkpoint sorting - after that flushes from the ckeckpoint will almost always be desirable. Bgwriter flushes are most of the time going to be random, which are slow on lots of storage hardware. Flushing in backends works well if the storage and bgwriter can keep up, but if not it can have negative consequences. This patch is likely to have negative performance consequences without checkpoint sorting, but unfortunately so has sorting without flush control. Discussion: alpine.DEB.2.10.1506011320000.28433@sto Author: Fabien Coelho and Andres Freund
* Rework wait for AccessExclusiveLocks on Hot StandbySimon Riggs2016-03-10
| | | | | | | | Earlier version committed in 9.0 caused spurious waits in some cases. New infrastructure for lock waits in 9.3 used to correct and improve this. Jeff Janes based upon a proposal by Simon Riggs, who also reviewed Additional review comments from Amit Kapila
* Provide much better wait information in pg_stat_activity.Robert Haas2016-03-10
| | | | | | | | | | | | When a process is waiting for a heavyweight lock, we will now indicate the type of heavyweight lock for which it is waiting. Also, you can now see when a process is waiting for a lightweight lock - in which case we will indicate the individual lock name or the tranche, as appropriate - or for a buffer pin. Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful discussion and suggestions by many others, including Alexander Korotkov, Vladimir Borodin, and many others.
* Introduce durable_rename() and durable_link_or_rename().Andres Freund2016-03-09
| | | | | | | | | | | | | | | | | | | | | | | | | Renaming a file using rename(2) is not guaranteed to be durable in face of crashes; especially on filesystems like xfs and ext4 when mounted with data=writeback. To be certain that a rename() atomically replaces the previous file contents in the face of crashes and different filesystems, one has to fsync the old filename, rename the file, fsync the new filename, fsync the containing directory. This sequence is not generally adhered to currently; which exposes us to data loss risks. To avoid having to repeat this arduous sequence, introduce durable_rename(), which wraps all that. Also add durable_link_or_rename(). Several places use link() (with a fallback to rename()) to rename a file, trying to avoid replacing the target file out of paranoia. Some of those rename sequences need to be durable as well. There seems little reason extend several copies of the same logic, so centralize the link() callers. This commit does not yet make use of the new functions; they're used in a followup commit. Author: Michael Paquier, Andres Freund Discussion: 56583BDD.9060302@2ndquadrant.com Backpatch: All supported branches
* Add some functions to fd.c for the convenience of extensions.Robert Haas2016-03-08
| | | | | | | | For example, if you want to perform an ioctl() on a file descriptor opened through the fd.c routines, there's no way to do that without being able to get at the underlying fd. KaiGai Kohei
* Change the format of the VM fork to add a second bit per page.Robert Haas2016-03-01
| | | | | | | | | | | | | | | | | | | | | | | The new bit indicates whether every tuple on the page is already frozen. It is cleared only when the all-visible bit is cleared, and it can be set only when we vacuum a page and find that every tuple on that page is both visible to every transaction and in no need of any future vacuuming. A future commit will use this new bit to optimize away full-table scans that would otherwise be triggered by XID wraparound considerations. A page which is merely all-visible must still be scanned in that case, but a page which is all-frozen need not be. This commit does not attempt that optimization, although that optimization is the goal here. It seems better to get the basic infrastructure in place first. Per discussion, it's very desirable for pg_upgrade to automatically migrate existing VM forks from the old format to the new format. That, too, will be handled in a follow-on patch. Masahiko Sawada, reviewed by Kyotaro Horiguchi, Fujii Masao, Amit Kapila, Simon Riggs, Andres Freund, and others, and substantially revised by me.