aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access
Commit message (Collapse)AuthorAge
* Continue to allow VACUUM to mark last block of index dirtySimon Riggs2011-11-22
| | | | | even when there is no work to do. Further analysis required. Revert of patch c1458cc495ff800cd176a1c2e56d8b62680d9b71
* Avoid marking buffer dirty when VACUUM has no work to do.Simon Riggs2011-11-18
| | | | | | | When wal_level = 'hot_standby' we touched the last page of the relation during a VACUUM, even if nothing else had happened. That would alter the LSN of the last block and set the mtime of the relation file unnecessarily. Noted by Thom Brown.
* Wakeup WALWriter as needed for asynchronous commit performance.Simon Riggs2011-11-13
| | | | | | | | Previously we waited for wal_writer_delay before flushing WAL. Now we also wake WALWriter as soon as a WAL buffer page has filled. Significant effect observed on performance of asynchronous commits by Robert Haas, attributed to the ability to set hint bits on tuples earlier and so reducing contention caused by clog lookups.
* Fix another bug in the redo of COPY batches.Heikki Linnakangas2011-11-10
| | | | | I got alignment wrong in the redo routine. Spotted by redoing the log genereated by copy regression test.
* Fix bugs in the COPY heap-insert batching patch.Heikki Linnakangas2011-11-09
| | | | | | | | | | Forgot to call RestoreBkpBlocks() in the redo-function, as pointed out by Simon Riggs. In redo of a regular heap insert, it's taken care of in heap_redo(), but this new record type uses the heap2 RM, and heap2_redo() does not take care of that for you. Also, failed to reset the vmbuffer and all_visibile_cleared local variables after switching to a new buffer.
* In COPY, insert tuples to the heap in batches.Heikki Linnakangas2011-11-09
| | | | | | | This greatly reduces the WAL volume, especially when the table is narrow. The overhead of locking the heap page is also reduced. Reduced WAL traffic also makes it scale a lot better, if you run multiple COPY processes at the same time.
* Make VACUUM avoid waiting for a cleanup lock, where possible.Robert Haas2011-11-07
| | | | | | | | | | | In a regular VACUUM, it's OK to skip pages for which a cleanup lock isn't immediately available; the next VACUUM will deal with them. If we're scanning the entire relation to advance relfrozenxid, we might need to wait, but only if there are tuples on the page that actually require freezing. These changes should greatly reduce the incidence of of vacuum processes getting "stuck". Simon Riggs and Robert Haas
* Don't assume that a tuple's header size is unchanged during toasting.Tom Lane2011-11-04
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This assumption can be wrong when the toaster is passed a raw on-disk tuple, because the tuple might pre-date an ALTER TABLE ADD COLUMN operation that added columns without rewriting the table. In such a case the tuple's natts value is smaller than what we expect from the tuple descriptor, and so its t_hoff value could be smaller too. In fact, the tuple might not have a null bitmap at all, and yet our current opinion of it is that it contains some trailing nulls. In such a situation, toast_insert_or_update did the wrong thing, because to save a few lines of code it would use the old t_hoff value as the offset where heap_fill_tuple should start filling data. This did not leave enough room for the new nulls bitmap, with the result that the first few bytes of data could be overwritten with null flag bits, as in a recent report from Hubert Depesz Lubaczewski. The particular case reported requires ALTER TABLE ADD COLUMN followed by CREATE TABLE AS SELECT * FROM ... or INSERT ... SELECT * FROM ..., and further requires that there be some out-of-line toasted fields in one of the tuples to be copied; else we'll not reach the troublesome code. The problem can only manifest in this form in 8.4 and later, because before commit a77eaa6a95009a3441e0d475d1980259d45da072, CREATE TABLE AS or INSERT/SELECT wouldn't result in raw disk tuples getting passed directly to heap_insert --- there would always have been at least a junkfilter in between, and that would reconstitute the tuple header with an up-to-date t_natts and hence t_hoff. But I'm backpatching the tuptoaster change all the way anyway, because I'm not convinced there are no older code paths that present a similar risk.
* Move user functions related to WAL into xlogfuncs.cSimon Riggs2011-11-04
|
* Avoid scanning nulls at the beginning of a btree index scan.Tom Lane2011-11-02
| | | | | | | | | | | If we have an inequality key that constrains the other end of the index, it doesn't directly help us in doing the initial positioning ... but it does imply a NOT NULL constraint on the index column. If the index stores nulls at this end, we can use the implied NOT NULL condition for initial positioning, just as if it had been stated explicitly. This avoids wasting time when there are a lot of nulls in the column. This is the reverse of the examples given in bugs #6278 and #6283, which were about failing to stop early when we encounter nulls at the end of the indexscan.
* Fix btree stop-at-nulls logic properly.Tom Lane2011-11-02
| | | | | | | | | | | | As pointed out by Naoya Anzai, my previous try at this was a few bricks shy of a load, because I had forgotten that the initial-positioning logic might not try to skip over nulls at the end of the index the scan will start from. We ought to fix that, because it represents an unnecessary inefficiency, but first let's get the scan-stop logic back to a safe state. With this patch, we preserve the performance benefit requested in bug #6278 for the case of scanning forward into NULLs (in a NULLS LAST index), but the reverse case of scanning backward across NULLs when there's no suitable initial-positioning qual is still inefficient.
* Update more comments about checkpoints being done by bgwriterSimon Riggs2011-11-02
|
* Reduce checkpoints and WAL traffic on low activity database serverSimon Riggs2011-11-02
| | | | | | | | | | | | Previously, we skipped a checkpoint if no WAL had been written since last checkpoint, though this does not appear in user documentation. As of now, we skip a checkpoint until we have written at least one enough WAL to switch the next WAL file. This greatly reduces the level of activity and number of WAL messages generated by a very low activity server. This is safe because the purpose of a checkpoint is to act as a starting place for a recovery, in case of crash. This patch maintains minimal WAL volume for replay in case of crash, thus maintaining very low crash recovery time.
* Refactor xlog.c to create src/backend/postmaster/startup.cSimon Riggs2011-11-02
| | | | | Startup process now has its own dedicated file, just like all other special/background processes. Reduces role and size of xlog.c
* Derive oldestActiveXid at correct time for Hot Standby.Simon Riggs2011-11-02
| | | | | | | | | There was a timing window between when oldestActiveXid was derived and when it should have been derived that only shows itself under heavy load. Move code around to ensure correct timing of derivation. No change to StartupSUBTRANS() code, which is where this failed. Bug report by Chris Redekop
* Fix timing of Startup CLOG and MultiXact during Hot StandbySimon Riggs2011-11-02
| | | | Patch by me, bug report by Chris Redekop, analysis by Florian Pflug
* Fix race condition with toast table access from a stale syscache entry.Tom Lane2011-11-01
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a tuple in a syscache contains an out-of-line toasted field, and we try to fetch that field shortly after some other transaction has committed an update or deletion of the tuple, there is a race condition: vacuum could come along and remove the toast tuples before we can fetch them. This leads to transient failures like "missing chunk number 0 for toast value NNNNN in pg_toast_2619", as seen in recent reports from Andrew Hammond and Tim Uckun. The design idea of syscache is that access to stale syscache entries should be prevented by relation-level locks, but that fails for at least two cases where toasted fields are possible: ANALYZE updates pg_statistic rows without locking out sessions that might want to plan queries on the same table, and CREATE OR REPLACE FUNCTION updates pg_proc rows without any meaningful lock at all. The least risky fix seems to be an idea that Heikki suggested when we were dealing with a related problem back in August: forcibly detoast any out-of-line fields before putting a tuple into syscache in the first place. This avoids the problem because at the time we fetch the parent tuple from the catalog, we should be holding an MVCC snapshot that will prevent removal of the toast tuples, even if the parent tuple is outdated immediately after we fetch it. (Note: I'm not convinced that this statement holds true at every instant where we could be fetching a syscache entry at all, but it does appear to hold true at the times where we could fetch an entry that could have a toasted field. We will need to be a bit wary of adding toast tables to low-level catalogs that don't have them already.) An additional benefit is that subsequent uses of the syscache entry should be faster, since they won't have to detoast the field. Back-patch to all supported versions. The problem is significantly harder to reproduce in pre-9.0 releases, because of their willingness to flush every entry in a syscache whenever the underlying catalog is vacuumed (cf CatalogCacheFlushRelation); but there is still a window for trouble.
* Comment changes to show bgwriter no longer performs checkpoints.Simon Riggs2011-11-01
|
* Stop btree indexscans upon reaching nulls in either direction.Tom Lane2011-10-31
| | | | | | | | | | | The existing scan-direction-sensitive tests were overly complex, and failed to stop the scan in cases where it's perfectly legitimate to do so. Per bug #6278 from Maksym Boguk. Back-patch to 8.3, which is as far back as the patch applies easily. Doesn't seem worth sweating over a relatively minor performance issue in 8.2 at this late date. (But note that this was a performance regression from 8.1 and before, so 8.2 is being left as an outlier.)
* Update visibilitymap.c header comments.Robert Haas2011-10-29
| | | | Recent work on index-only scans left this somewhat out of date.
* Support synchronization of snapshots through an export/import procedure.Tom Lane2011-10-22
| | | | | | | | | | | | | | A transaction can export a snapshot with pg_export_snapshot(), and then others can import it with SET TRANSACTION SNAPSHOT. The data does not leave the server so there are not security issues. A snapshot can only be imported while the exporting transaction is still running, and there are some other restrictions. I'm not totally convinced that we've covered all the bases for SSI (true serializable) mode, but it works fine for lesser isolation modes. Joachim Wieland, reviewed by Marko Tiikkaja, and rather heavily modified by Tom Lane
* Suppress -Wunused-result warnings about write() and fwrite().Tom Lane2011-10-18
| | | | | | | This is merely an exercise in satisfying pedants, not a bug fix, because in every case we were checking for failure later with ferror(), or else there was nothing useful to be done about a failure anyway. Document the latter cases.
* Avoid assuming that index-only scan data matches the index's rowtype.Tom Lane2011-10-16
| | | | | | | | | | | | | | | | | | | | | | In general the data returned by an index-only scan should have the datatypes originally computed by FormIndexDatum. If the index opclasses use "storage" datatypes different from their input datatypes, the scan tuple will not have the same rowtype attributed to the index; but we had a hard-wired assumption that that was true in nodeIndexonlyscan.c. We'd already hacked around the issue for the one case where the types are different in btree indexes (btree name_ops), but this would definitely come back to bite us if we ever implement index-only scans in GiST. To fix, require the index AM to explicitly provide the tupdesc for the tuple it is returning. btree can just pass back the index's tupdesc, but GiST will have to work harder when and if it supports index-only scans. I had previously proposed fixing this by allowing the index AM to fill the scan tuple slot directly; but on reflection that seemed like a module layering violation, since TupleTableSlots are creatures of the executor. At least in the btree case, it would also be less efficient, since the tuple deconstruction work would occur even for rows later found to be invisible to the scan's snapshot.
* Teach btree to handle ScalarArrayOpExpr quals natively.Tom Lane2011-10-16
| | | | | This allows "indexedcol op ANY(ARRAY[...])" conditions to be used in plain indexscans, and particularly in index-only scans.
* Measure the number of all-visible pages for use in index-only scan costing.Tom Lane2011-10-14
| | | | | | | | | | | | | | | | | Add a column pg_class.relallvisible to remember the number of pages that were all-visible according to the visibility map as of the last VACUUM (or ANALYZE, or some other operations that update pg_class.relpages). Use relallvisible/relpages, instead of an arbitrary constant, to estimate how many heap page fetches can be avoided during an index-only scan. This is pretty primitive and will no doubt see refinements once we've acquired more field experience with the index-only scan mechanism, but it's way better than using a constant. Note: I had to adjust an underspecified query in the window.sql regression test, because it was changing answers when the plan changed to use an index-only scan. Some of the adjacent tests perhaps should be adjusted as well, but I didn't do that here.
* Modify RelationGetBufferForTuple() to use a typedef, rather than aBruce Momjian2011-10-12
| | | | struct, to help pgindent.
* Add comment on why pulling data from a "name" index column can't crash.Tom Lane2011-10-11
| | | | | | | | | | | It's been bothering me for several days that pretending that the cstring data stored in a btree name_ops column is really a "name" Datum could lead to reading past the end of memory. However, given the current memory layout used for index-only scans in the btree code, a crash is in fact not possible. Document that so we don't break it. I have not thought of any other solutions that aren't fairly ugly too, and most of them lose the functionality of index-only scans on name columns altogether, so this seems like the way to go.
* Clean up a couple of box gist helper functions.Heikki Linnakangas2011-10-09
| | | | | | | | | | The original idea of this patch was to make box picksplit run faster, by eliminating unnecessary palloc() overhead, but that was obsoleted by the new double-sorting split algorithm that doesn't call these functions so heavily anymore. Nevertheless, the code looks better this way. Original patch by me, reviewed and tidied up after the double-sorting patch by Kevin Grittner.
* Improve index-only scans to avoid repeated access to the index page.Tom Lane2011-10-09
| | | | | | | | | | | | We copy all the matched tuples off the page during _bt_readpage, instead of expensively re-locking the page during each subsequent tuple fetch. This costs a bit more local storage, but not more than 2*BLCKSZ worth, and the reduction in LWLock traffic is certainly worth that. What's more, this lets us get rid of the API wart in the original patch that said an index AM could randomly decline to supply an index tuple despite having asserted pg_am.amcanreturn. That will be important for future improvements in the index-only-scan feature, since the executor will now be able to rely on having the index data available.
* Support index-only scans using the visibility map to avoid heap fetches.Tom Lane2011-10-07
| | | | | | | | | | | | | When a btree index contains all columns required by the query, and the visibility map shows that all tuples on a target heap page are visible-to-all, we don't need to fetch that heap page. This patch depends on the previous patches that made the visibility map reliable. There's a fair amount left to do here, notably trying to figure out a less chintzy way of estimating the cost of an index-only scan, but the core functionality seems ready to commit. Robert Haas and Ibrar Ahmed, with some previous work by Heikki Linnakangas.
* Replace the "New Linear" GiST split algorithm for boxes and points with aHeikki Linnakangas2011-10-06
| | | | | | | new double-sorting algorithm. The new algorithm produces better quality trees, making searches faster. Alexander Korotkov
* Fix uninitialized-variable bug.Tom Lane2011-10-04
|
* Use callbacks in SlruScanDirectory for the actual actionAlvaro Herrera2011-10-04
| | | | | | | | | | | | Previously, the code assumed that the only possible action to take was to delete files behind a certain cutoff point. The async notify code was already a crock: it used a different "pagePrecedes" function for truncation than for regular operation. By allowing it to pass a callback to SlruScanDirectory it can do cleanly exactly what it needs to do. The clog.c code also had its own use for SlruScanDirectory, which is made a bit simpler with this.
* Restructure error handling in reading of postgresql.conf.Tom Lane2011-10-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch has two distinct purposes: to report multiple problems in postgresql.conf rather than always bailing out after the first one, and to change the policy for whether changes are applied when there are unrelated errors in postgresql.conf. Formerly the policy was to apply no changes if any errors could be detected, but that had a significant consistency problem, because in some cases specific values might be seen as valid by some processes but invalid by others. This meant that the latter processes would fail to adopt changes in other parameters even though the former processes had done so. The new policy is that during SIGHUP, the file is rejected as a whole if there are any errors in the "name = value" syntax, or if any lines attempt to set nonexistent built-in parameters, or if any lines attempt to set custom parameters whose prefix is not listed in (the new value of) custom_variable_classes. These tests should always give the same results in all processes, and provide what seems a reasonably robust defense against loading values from badly corrupted config files. If these tests pass, all processes will apply all settings that they individually see as good, ignoring (but logging) any they don't. In addition, the postmaster does not abandon reading a configuration file after the first syntax error, but continues to read the file and report syntax errors (up to a maximum of 100 syntax errors per file). The postmaster will still refuse to start up if the configuration file contains any errors at startup time, but these changes allow multiple errors to be detected and reported before quitting. Alexey Klyukin, reviewed by Andy Colson and av (Alexander ?) with some additional hacking by Tom Lane
* Support GiST index support functions that want to cache data across calls.Tom Lane2011-09-30
| | | | | | | | | | | | pg_trgm was already doing this unofficially, but the implementation hadn't been thought through very well and leaked memory. Restructure the core GiST code so that it actually works, and document it. Ordinarily this would have required an extra memory context creation/destruction for each GiST index search, but I was able to avoid that in the normal case of a non-rescanned search by finessing the handling of the RBTree. It used to have its own context always, but now shares a context with the scan-lifespan data structures, unless there is more than one rescan call. This should make the added overhead unnoticeable in typical cases.
* Update comments related to the crash-safety of the visibility map.Robert Haas2011-09-27
| | | | | | | | In hio.c, document how we avoid deadlock with respect to visibility map buffer locks. In visibilitymap.c, update the LOCKING section of the file header comment. Both oversights noted by Heikki Linnakangas.
* heap_update() must recheck tuple after unlocking and relocking buffer.Robert Haas2011-09-27
| | | | | Bug found by Alvaro Herrera, fix suggested by Heikki Linnakangas and reviewed by Tom Lane.
* Allow snapshot references to still work during transaction abort.Tom Lane2011-09-26
| | | | | | | | | | | | | | | | | | | | | | | | In REPEATABLE READ (nee SERIALIZABLE) mode, an attempt to do GetTransactionSnapshot() between AbortTransaction and CleanupTransaction failed, because GetTransactionSnapshot would recompute the transaction snapshot (which is already wrong, given the isolation mode) and then re-register it in the TopTransactionResourceOwner, leading to an Assert because the TopTransactionResourceOwner should be empty of resources after AbortTransaction. This is the root cause of bug #6218 from Yamamoto Takashi. While changing plancache.c to avoid requesting a snapshot when handling a ROLLBACK masks the problem, I think this is really a snapmgr.c bug: it's lower-level than the resource manager mechanism and should not be shutting itself down before we unwind resource manager resources. However, just postponing the release of the transaction snapshot until cleanup time didn't work because of the circular dependency with TopTransactionResourceOwner. Fix by managing the internal reference to that snapshot manually instead of depending on TopTransactionResourceOwner. This saves a few cycles as well as making the module layering more straightforward. predicate.c's dependencies on TopTransactionResourceOwner go away too. I think this is a longstanding bug, but there's no evidence that it's more than a latent bug, so it doesn't seem worth any risk of back-patching.
* Avoid unnecessary page-level SSI lock check in heap_insert().Tom Lane2011-09-16
| | | | | | | | | | As observed by Heikki, we need not conflict on heap page locks during an insert; heap page locks are only aggregated tuple locks, they don't imply locking "gaps" as index page locks do. So we can avoid some unnecessary conflicts, and also do the SSI check while not holding exclusive lock on the target buffer. Kevin Grittner, reviewed by Jeff Davis. Back-patch to 9.1.
* gistendscan() forgot to free so->giststate.Tom Lane2011-09-16
| | | | | | | | | | This oversight led to a massive memory leak --- upwards of 10KB per tuple --- during creation-time verification of an exclusion constraint based on a GIST index. In most other scenarios it'd just be a leak of 10KB that would be recovered at end of query, so not too significant; though perhaps the leak would be noticeable in a situation where a GIST index was being used in a nestloop inner indexscan. In any case, it's a real leak of long standing, so patch all supported branches. Per report from Harald Fuchs.
* In the final emptying phase of the new GiST buffering build, set theHeikki Linnakangas2011-09-12
| | | | | | | | | | queuedForEmptying flag correctly on buffer when adding it to the queue. Also, don't add buffer to the queue if it's there already. These were harmless oversights; failing to set the flag just means that a buffer might get added to the queue twice if more tuples are added to it (although that can't actually happen at this point because all the upper buffers have already been emptied), and having the same buffer twice in the emptying queue is harmless. But better be tidy.
* Move Timestamp/Interval typedefs and basic macros into datatype/timestamp.h.Tom Lane2011-09-09
| | | | | | | | | | | As per my recent proposal, this refactors things so that these typedefs and macros are available in a header that can be included in frontend-ish code. I also changed various headers that were undesirably including utils/timestamp.h to include datatype/timestamp.h instead. Unsurprisingly, this showed that half the system was getting utils/timestamp.h by way of xlog.h. No actual code changes here, just header refactoring.
* round() is not portable. Use rint().Tom Lane2011-09-08
|
* Buffering GiST index build algorithm.Heikki Linnakangas2011-09-08
| | | | | | | | | When building a GiST index that doesn't fit in cache, buffers are attached to some internal nodes in the index. This speeds up the build by avoiding random I/O that would otherwise be needed to traverse all the way down the tree to the find right leaf page for tuple. Alexander Korotkov
* Partially revoke attempt to improve performance with many savepoints.Simon Riggs2011-09-07
| | | | | Maintain difference between subtransaction release and commit introduced by earlier patch.
* Adjust translator comment format to xgettext expectationsAlvaro Herrera2011-09-05
|
* Mark some untranslatable messages with errmsg_internalAlvaro Herrera2011-09-05
|
* Clean up the #include mess a little.Tom Lane2011-09-04
| | | | | | | | | | | | | | | | | walsender.h should depend on xlog.h, not vice versa. (Actually, the inclusion was circular until a couple hours ago, which was even sillier; but Bruce broke it in the expedient rather than logically correct direction.) Because of that poor decision, plus blind application of pgrminclude, we had a situation where half the system was depending on xlog.h to include such unrelated stuff as array.h and guc.h. Clean up the header inclusion, and manually revert a lot of what pgrminclude had done so things build again. This episode reinforces my feeling that pgrminclude should not be run without adult supervision. Inclusion changes in header files in particular need to be reviewed with great care. More generally, it'd be good if we had a clearer notion of module layering to dictate which headers can sanely include which others ... but that's a big task for another day.
* Whitespace adjustment for consistency in the filePeter Eisentraut2011-09-03
|
* Remove unnecessary #include references, per pgrminclude script.Bruce Momjian2011-09-01
|