aboutsummaryrefslogtreecommitdiff
path: root/src/backend/storage/buffer/bufmgr.c
Commit message (Collapse)AuthorAge
* Update copyrights for 2013Bruce Momjian2013-01-01
| | | | | Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.
* Use correct text domain for translating errcontext() messages.Heikki Linnakangas2012-11-12
| | | | | | | | | | | | | | | | | | | errcontext() is typically used in an error context callback function, not within an ereport() invocation like e.g errmsg and errdetail are. That means that the message domain that the TEXTDOMAIN magic in ereport() determines is not the right one for the errcontext() calls. The message domain needs to be determined by the C file containing the errcontext() call, not the file containing the ereport() call. Fix by turning errcontext() into a macro that passes the TEXTDOMAIN to use for the errcontext message. "errcontext" was used in a few places as a variable or struct field name, I had to rename those out of the way, now that errcontext is a macro. We've had this problem all along, but this isn't doesn't seem worth backporting. It's a fairly minor issue, and turning errcontext from a function to a macro requires at least a recompile of any external code that calls errcontext().
* Revert "Use "transient" files for blind writes, take 2".Tom Lane2012-10-17
| | | | | | | | | | This reverts commit fba105b1099f4f5fa7283bb17cba6fed2baa8d0c. That approach had problems with the smgr-level state not tracking what we really want to happen, and with the VFD-level state not tracking the smgr-level state very well either. In consequence, it was still possible to hold kernel file descriptors open for long-gone tables (as in recent report from Tore Halset), and yet there were also cases of FDs being closed undesirably soon. A replacement implementation will follow.
* Fix bufmgr so CHECKPOINT_END_OF_RECOVERY behaves as a shutdown checkpoint.Simon Riggs2012-09-16
| | | | | | | | | Recovery code documents clearly that a shutdown checkpoint is executed at end of recovery - a shutdown checkpoint WAL record is written but the buffer manager had been altered to treat end of recovery as a normal checkpoint. This bug exacerbates the bufmgr relpersistence bug. Bug spotted by Andres Freund, patch by me.
* Properly set relpersistence for fake relcache entries.Robert Haas2012-09-14
| | | | | | | This can result in buffers failing to be properly flushed at checkpoint time, leading to data loss. Report, diagnosis, and patch by Jeff Davis.
* Split resowner.hAlvaro Herrera2012-08-28
| | | | | This lets files that are mere users of ResourceOwner not automatically include the headers for stuff that is managed by the resowner mechanism.
* Improve coding around the fsync request queue.Tom Lane2012-07-17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In all branches back to 8.3, this patch fixes a questionable assumption in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue that there are no uninitialized pad bytes in the request queue structs. This would only cause trouble if (a) there were such pad bytes, which could happen in 8.4 and up if the compiler makes enum ForkNumber narrower than 32 bits, but otherwise would require not-currently-planned changes in the widths of other typedefs; and (b) the kernel has not uniformly initialized the contents of shared memory to zeroes. Still, it seems a tad risky, and we can easily remove any risk by pre-zeroing the request array for ourselves. In addition to that, we need to establish a coding rule that struct RelFileNode can't contain any padding bytes, since such structs are copied into the request array verbatim. (There are other places that are assuming this anyway, it turns out.) In 9.1 and up, the risk was a bit larger because we were also effectively assuming that struct RelFileNodeBackend contained no pad bytes, and with fields of different types in there, that would be much easier to break. However, there is no good reason to ever transmit fsync or delete requests for temp files to the bgwriter/checkpointer, so we can revert the request structs to plain RelFileNode, getting rid of the padding risk and saving some marginal number of bytes and cycles in fsync queue manipulation while we are at it. The savings might be more than marginal during deletion of a temp relation, because the old code transmitted an entirely useless but nonetheless expensive-to-process ForgetRelationFsync request to the background process, and also had the background process perform the file deletion even though that can safely be done immediately. In addition, make some cleanup of nearby comments and small improvements to the code in CompactCheckpointerRequestQueue/CompactBgwriterRequestQueue.
* Run pgindent on 9.2 source tree in preparation for first 9.3Bruce Momjian2012-06-10
| | | | commit-fest.
* Scan the buffer pool just once, not once per fork, during relation drop.Tom Lane2012-06-07
| | | | | | | | This provides a speedup of about 4X when NBuffers is large enough. There is also a useful reduction in sinval traffic, since we only do CacheInvalidateSmgr() once not once per fork. Simon Riggs, reviewed and somewhat revised by Tom Lane
* Do unlocked prechecks in bufmgr.c loops that scan the whole buffer pool.Tom Lane2012-06-07
| | | | | | | | | | | | | | | | | | | | DropRelFileNodeBuffers, DropDatabaseBuffers, FlushRelationBuffers, and FlushDatabaseBuffers have to scan the whole shared_buffers pool because we have no index structure that would find the target buffers any more efficiently than that. This gets expensive with large NBuffers. We can shave some cycles from these loops by prechecking to see if the current buffer is interesting before we acquire the buffer header lock. Ordinarily such a test would be unsafe, but in these cases it should be safe because we are already assuming that the caller holds a lock that prevents any new target pages from being loaded into the buffer pool concurrently. Therefore, no buffer tag should be changing to a value of interest, only away from a value of interest. So a false negative match is impossible, while a false positive is safe because we'll recheck after acquiring the buffer lock. Initial testing says that this speeds these loops by a factor of 2X to 3X on common Intel hardware. Patch for DropRelFileNodeBuffers by Jeff Janes (based on an idea of Heikki's); extended to the remaining sequential scans by Tom Lane
* Improve control logic for bgwriter hibernation mode.Tom Lane2012-05-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6d90eaaa89a007e0d365f49d6436f35d2392cfeb added a hibernation mode to the bgwriter to reduce the server's idle-power consumption. However, its interaction with the detailed behavior of BgBufferSync's feedback control loop wasn't very well thought out. That control loop depends primarily on the rate of buffer allocation, not the rate of buffer dirtying, so the hibernation mode has to be designed to operate only when no new buffer allocations are happening. Also, the check for whether the system is effectively idle was not quite right and would fail to detect a constant low level of activity, thus allowing the bgwriter to go into hibernation mode in a way that would let the cycle time vary quite a bit, possibly further confusing the feedback loop. To fix, move the wakeup support from MarkBufferDirty and SetBufferCommitInfoNeedsSave into StrategyGetBuffer, and prevent the bgwriter from entering hibernation mode unless no buffer allocations have happened recently. In addition, fix the delaying logic to remove the problem of possibly not responding to signals promptly, which was basically caused by trying to use the process latch's is_set flag for multiple purposes. I can't prove it but I'm suspicious that that hack was responsible for the intermittent "postmaster does not shut down" failures we've been seeing in the buildfarm lately. In any case it did nothing to improve the readability or robustness of the code. In passing, express the hibernation sleep time as a multiplier on BgWriterDelay, not a constant. I'm not sure whether there's any value in exposing the longer sleep time as an independently configurable setting, but we can at least make it act like this for little extra code.
* Rename I/O timing statistics columns to blk_read_time and blk_write_time.Tom Lane2012-04-29
| | | | | This seems more consistent with the pre-existing choices for names of other statistics columns. Rename assorted internal identifiers to match.
* Rename track_iotiming GUC to track_io_timing.Tom Lane2012-04-29
| | | | This spelling seems significantly more readable to me.
* Fix incorrect comment in SetBufferCommitInfoNeedsSave().Robert Haas2012-04-18
| | | | | Noah Misch spotted the fact that the old comment is in fact incorrect, due to memory ordering hazards.
* Expose track_iotiming data via the statistics collector.Robert Haas2012-04-05
| | | | | | Ants Aasma's original patch to add timing information for buffer I/O requests exposed this data at the relation level, which was judged too costly. I've here exposed it at the database level instead.
* New GUC, track_iotiming, to track I/O timings.Robert Haas2012-03-27
| | | | | | | | Currently, the only way to see the numbers this gathers is via EXPLAIN (ANALYZE, BUFFERS), but the plan is to add visibility through the stats collector and pg_stat_statements in subsequent patches. Ants Aasma, reviewed by Greg Smith, with some further changes by me.
* Make EXPLAIN (BUFFERS) track blocks dirtied, as well as those written.Robert Haas2012-02-22
| | | | | | Also expose the new counters through pg_stat_statements. Patch by me. Review by Fujii Masao and Greg Smith.
* Make bgwriter sleep longer when it has no work to do, to save electricity.Heikki Linnakangas2012-01-26
| | | | | | | | | To make it wake up promptly when activity starts again, backends nudge it by setting a latch in MarkBufferDirty(). The latch is kept set while bgwriter is active, so there is very little overhead from that when the system is busy. It is only armed before going into longer sleep. Peter Geoghegan, with some changes by me.
* Fix variable confusion in BufferSync().Robert Haas2012-01-06
| | | | | | | | | As noted by Heikki Linnakangas, the previous coding confused the "flags" variable with the "mask" variable. The affect of this appears to be that unlogged buffers would get written out at every checkpoint rather than only at shutdown time. Although that's arguably an acceptable failure mode, I'm back-patching this change, since it seems like a poor idea to rely on this happening to work.
* Update copyright notices for year 2012.Bruce Momjian2012-01-01
|
* Improve logging of autovacuum I/O activityAlvaro Herrera2011-11-25
| | | | | | | | | This adds some I/O stats to the logging of autovacuum (when the operation takes long enough that log_autovacuum_min_duration causes it to be logged), so that it is easier to tune. Notably, it adds buffer I/O counts (hits, misses, dirtied) and read and write rate. Authors: Greg Smith and Noah Misch
* Avoid floating-point underflow while tracking buffer allocation rate.Tom Lane2011-11-19
| | | | | | | | | | | | | | | | | When the system is idle for awhile after activity, the "smoothed_alloc" state variable in BgBufferSync converges slowly to zero. With standard IEEE float arithmetic this results in several iterations with denormalized values, which causes kernel traps and annoying log messages on some poorly-designed platforms. There's no real need to track such small values of smoothed_alloc, so we can prevent the kernel traps by forcing it to zero as soon as it's too small to be interesting for our purposes. This issue is purely cosmetic, since the iterations don't happen fast enough for the kernel traps to pose any meaningful performance problem, but still it seems worth shutting up the log messages. The kernel log messages were previously reported by a number of people, but kudos to Greg Matthews for tracking down exactly where they were coming from.
* Split work of bgwriter between 2 processes: bgwriter and checkpointer.Simon Riggs2011-11-01
| | | | | | | | | | | | | bgwriter is now a much less important process, responsible for page cleaning duties only. checkpointer is now responsible for checkpoints and so has a key role in shutdown. Later patches will correct doc references to the now old idea that bgwriter performs checkpoints. Has beneficial effect on performance at high write rates, but mainly refactoring to more easily allow changes for power reduction by simplifying previously tortuous code around required to allow page cleaning and checkpointing to time slice in the same process. Patch by me, Review by Dickson Guedes
* Allow hint bits to be set sooner for temporary and unlogged tables.Robert Haas2011-10-28
| | | | | | | | | | | We need not wait until the commit record is durably on disk, because in the event of a crash the page we're updating with hint bits will be gone anyway. Per off-list report from Heikki Linnakangas, this can significantly degrade the performance of unlogged tables; I was able to show a 2x speedup from this patch on a pgbench run with scale factor 15. In practice, this will mostly help small, heavily updated tables, because on larger tables you're unlikely to run into the same row again before the commit record makes it out to disk.
* Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.Tom Lane2011-09-09
| | | | | | | | | | | As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.
* Fix incorrect initialization of ProcGlobal->startupBufferPinWaitBufId.Tom Lane2011-08-02
| | | | | | | It was initialized in the wrong place and to the wrong value. With bad luck this could result in incorrect query-cancellation failures in hot standby sessions, should a HS backend be holding pin on buffer number 1 while trying to acquire a lock.
* Unify spelling of "canceled", "canceling", "cancellation"Peter Eisentraut2011-06-29
| | | | | We had previously (af26857a2775e7ceb0916155e931008c2116632f) established the U.S. spellings as standard.
* Capitalization fixesPeter Eisentraut2011-06-19
|
* Use "transient" files for blind writes, take 2Alvaro Herrera2011-06-10
| | | | | | | | | | | | | | | | | | | | | "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.) This commit fixes a bug in the previous one, which neglected to cleanly handle the LRU ring that fd.c uses to manage open files, and caused an unacceptable failure just before beta2 and was thus reverted.
* Revert "Use "transient" files for blind writes"Alvaro Herrera2011-06-09
| | | | | | This reverts commit 54d9e8c6c19cbefa8fb42ed3442a0a5327590ed3, which caused a failure on the buildfarm. Not a good thing to have just before a beta release.
* Use "transient" files for blind writesAlvaro Herrera2011-06-09
| | | | | | | | | | | | | | | | | "Blind writes" are a mechanism to push buffers down to disk when evicting them; since they may belong to different databases than the one a backend is connected to, the backend does not necessarily have a relation to link them to, and thus no way to blow them away. We were keeping those files open indefinitely, which would cause a problem if the underlying table was deleted, because the operating system would not be able to reclaim the disk space used by those files. To fix, have bufmgr mark such files as transient to smgr; the lower layer is allowed to close the file descriptor when the current transaction ends. We must be careful to have any other access of the file to remove the transient markings, to prevent unnecessary expensive system calls when evicting buffers belonging to our own database (which files we're likely to require again soon.)
* pgindent run before PG 9.1 beta 1.Bruce Momjian2011-04-10
|
* Stamp copyrights for year 2011.Bruce Momjian2011-01-01
|
* Support unlogged tables.Robert Haas2010-12-29
| | | | | | | The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.
* Generalize concept of temporary relations to "relation persistence".Robert Haas2010-12-13
| | | | | | | | | | | | | | | This commit replaces pg_class.relistemp with pg_class.relpersistence; and also modifies the RangeVar node type to carry relpersistence rather than istemp. It also removes removes rd_istemp from RelationData and instead performs the correct computation based on relpersistence. For clarity, we add three new macros: RelationNeedsWAL(), RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we can clarify the purpose of each check that previous depended on rd_istemp. This is intended as infrastructure for the upcoming unlogged tables patch, as well as for future possible work on global temporary tables.
* Remove belt-and-suspenders guards against buffer pin leaks.Robert Haas2010-11-25
| | | | | | | | Forcibly releasing all leftover buffer pins should be unnecessary now that we have a robust ResourceOwner mechanism, and it significantly increases the cost of process shutdown. Instead, in an assert-enabled build, assert that no pins are held; in a non-assert-enabled build, do nothing.
* Remove cvs keywords from all files.Magnus Hagander2010-09-20
|
* Remove the isLocalBuf argument from ReadBuffer_common.Robert Haas2010-08-20
| | | | | | Since an SMgrRelation now knows whether or not the underlying relation is temporary, there's no point in also passing that information via an additional argument.
* Further dtrace adjustments for the backend-IDs-in-relpath patch.Robert Haas2010-08-14
| | | | | Update the documentation, and back out a few ill-considered changes whose folly I failed to realize for failure to read the documentation.
* Fix assorted dtrace breakage caused by patch to include backend IDsRobert Haas2010-08-13
| | | | | | in temp relpaths. Per buildfarm.
* Include the backend ID in the relpath of temporary relations.Robert Haas2010-08-13
| | | | | | | | | | | | | | | | | This allows us to reliably remove all leftover temporary relation files on cluster startup without reference to system catalogs or WAL; therefore, we no longer include temporary relations in XLOG_XACT_COMMIT and XLOG_XACT_ABORT WAL records. Since these changes require including a backend ID in each SharedInvalSmgrMsg, the size of the SharedInvalidationMessage.id field has been reduced from two bytes to one, and the maximum number of connections has been reduced from INT_MAX / 4 to 2^23-1. It would be possible to remove these restrictions by increasing the size of SharedInvalidationMessage by 4 bytes, but right now that doesn't seem like a good trade-off. Review by Jaime Casanova and Tom Lane.
* pgindent run for 9.0Bruce Momjian2010-02-26
|
* In HS, Startup process sets SIGALRM when waiting for buffer pin. IfSimon Riggs2010-01-23
| | | | | | | woken by alarm we send SIGUSR1 to all backends requesting that they check to see if they are blocking Startup process. If so, they throw ERROR/FATAL as for other conflict resolutions. Deadlock stop gap removed. max_standby_delay = -1 option removed to prevent deadlock.
* Update copyright for the year 2010.Bruce Momjian2010-01-02
|
* Add an EXPLAIN (BUFFERS) option to show buffer-usage statistics.Robert Haas2009-12-15
| | | | | | | | This patch also removes buffer-usage statistics from the track_counts output, since this (or the global server statistics) is deemed to be a better interface to this information. Itagaki Takahiro, reviewed by Euler Taveira de Oliveira.
* 8.4 pgindent run, with new combined Linux/FreeBSD/MinGW typedef listBruce Momjian2009-06-11
| | | | provided by Andrew.
* Add a comment documenting the question of whether PrefetchBuffer shouldTom Lane2009-04-03
| | | | | | | try to protect an already-existing buffer from being evicted. This was left as an open issue when the posix_fadvise patch was committed. I'm not sure there's any evidence to justify more work in this area, but we should have some record about it in the source code.
* Modify the relcache to record the temp status of both local and nonlocalTom Lane2009-03-31
| | | | | | | | | | temp relations; this is no more expensive than before, now that we have pg_class.relistemp. Insert tests into bufmgr.c to prevent attempting to fetch pages from nonlocal temp relations. This provides a low-level defense against bugs-of-omission allowing temp pages to be loaded into shared buffers, as in the contrib/pgstattuple problem reported by Stuart Bishop. While at it, tweak a bunch of places to use new relcache tests (instead of expensive probes into pg_namespace) to detect local or nonlocal temp tables.
* More fixes for 8.4 DTrace probes. Remove useless BUFFER_HIT/BUFFER_MISSTom Lane2009-03-23
| | | | | | | probes --- the BUFFER_READ_DONE probe provides the same information and more besides. Expand the LOCK_WAIT_START/DONE probe arguments so that there's actually some chance of telling what is being waited for. Update and clean up the documentation.
* Add isExtend to the parameters of the buffer_read_start and buffer_read_doneTom Lane2009-03-22
| | | | | | | | | | | | | | | DTrace probes, so that ordinary reads can be distinguished from relation extension operations. Move buffer_read_start probe to before the smgrnblocks() call that's needed in the isExtend case, since really that step should be charged as part of the time needed for the extension operation. (This makes it slightly harder to match the read_start with the associated read_done, since now you can't match them on blockNumber, but it should still be possible since isExtend operations on the same relation can never be interleaved.) Per recent discussion. In passing, add the page identity (forkNum/blockNum) to the parameters of the buffer_flush_start/buffer_flush_done probes, which were unaccountably lacking the info.