aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/gist
Commit message (Collapse)AuthorAge
* Fix multiple problems in WAL replay.Tom Lane2012-11-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
* Back-patch recent fixes for gistchoose and gistRelocateBuildBuffersOnSplit.Tom Lane2012-08-30
| | | | | | | | | | | | | | | | This back-ports commits c8ba697a4bdb934f0c51424c654e8db6133ea255 and e5db11c5582b469c04a11f217a0f32c827da5dd7, which fix one definite and one speculative bug in gistchoose, and make the code a lot more intelligible as well. In 9.2 only, this also affects the largely-copied-and-pasted logic in gistRelocateBuildBuffersOnSplit. The impact of the bugs was that the functions might make poor decisions as to which index tree branch to push a new entry down into, resulting in GiST index bloat and poor performance. The fixes rectify these decisions for future insertions, but a REINDEX would be needed to clean up any existing index bloat. Alexander Korotkov, Robert Haas, Tom Lane
* Fix GiST buffering build bug, which caused "failed to re-find parent" errors.Heikki Linnakangas2012-08-16
| | | | | | | | | | | | | | | We use a hash table to track the parents of inner pages, but when inserting to a leaf page, the caller of gistbufferinginserttuples() must pass a correct block number of the leaf's parent page. Before gistProcessItup() descends to a child page, it checks if the downlink needs to be adjusted to accommodate the new tuple, and updates the downlink if necessary. However, updating the downlink might require splitting the page, which might move the downlink to a page to the right. gistProcessItup() doesn't realize that, so when it descends to the leaf page, it might pass an out-of-date parent block number as a result. Fix that by returning the block a tuple was inserted to from gistbufferinginserttuples(). This fixes the bug reported by Zdeněk Jílovec.
* Run pgindent on 9.2 source tree in preparation for first 9.3Bruce Momjian2012-06-10
| | | | commit-fest.
* Change the way parent pages are tracked during buffered GiST build.Heikki Linnakangas2012-05-30
| | | | | | | | | | | | | | | | | | We used to mimic the way a stack is constructed when descending the tree during normal GiST inserts, but that was quite complicated during a buffered build. It was also wrong: in GiST, the left-to-right relationships on different levels might not match each other, so that when you know the parent of a child page, you won't necessarily find the parent of the page to the right of the child page by following the rightlinks at the parent level. This sometimes led to "could not re-find parent" errors while building a GiST index. We now use a simple hash table to track the parent of every internal page. Whenever a page is split, and downlinks are moved from one page to another, we update the hash table accordingly. This is also better for performance than the old method, as we never need to move right to re-find the parent page, which could take a significant amount of time for buffers that were created much earlier in the index build.
* Delete the temporary file used in buffered GiST build, after the build.Heikki Linnakangas2012-05-30
| | | | | | There were two bugs here: We forgot to call gistFreeBuildBuffers() function at the end of build, and we passed interXact == true to BufFileCreateTemp, so the file wasn't automatically cleaned up at end-of-transaction either.
* Fix integer overflow bug in GiST buffering build calculations.Heikki Linnakangas2012-05-29
| | | | | | | The result of (maintenance_work_mem * 1024) / BLCKSZ doesn't fit in a signed 32-bit integer, if maintenance_work_mem >= 2GB. Use double instead. And while we're at it, write the calculations in an easier to understand form, with the intermediary steps written out and commented.
* Fix bug in gistRelocateBuildBuffersOnSplit().Heikki Linnakangas2012-05-18
| | | | | | | | | | | | | | | | | | | When we create a temporary copy of the old node buffer, in stack, we mustn't leak that into any of the long-lived data structures. Before this patch, when we called gistPopItupFromNodeBuffer(), it got added to the array of "loaded buffers". After gistRelocateBuildBuffersOnSplit() exits, the pointer added to the loaded buffers array points to garbage. Often that goes unnotied, because when we go through the array of loaded buffers to unload them, buffers with a NULL pageBuffer are ignored, which can often happen by accident even if the pointer points to garbage. This patch fixes that by marking the temporary copy in stack explicitly as temporary, and refrain from adding buffers marked as temporary to the array of loaded buffers. While we're at it, initialize nodeBuffer->pageBlocknum to InvalidBlockNumber and improve comments a bit. This isn't strictly necessary, but makes debugging easier.
* Fix obsolescent C declaration syntaxPeter Eisentraut2012-05-12
| | | | | gcc -Wextra/-Wold-style-declaration thinks that "inline" should go before the function return type.
* On GiST page split, release the locks on child pages before recursing up.Heikki Linnakangas2012-05-11
| | | | | | | | | | | | | | | | | | When inserting the downlinks for a split gist page, we used hold the locks on the child pages until the insertion into the parent - and recursively its parent if it had to be split too - were all completed. Change that so that the locks on child pages are released after the insertion in the immediate parent is done, before recursing further up the tree. This reduces the number of lwlocks that are held simultaneously. Holding many locks is bad for concurrency, and in extreme cases you can even hit the limit of 100 simultaneously held lwlocks in a backend. If you're really unlucky, you can hit the limit while in a critical section, which brings down the whole system. This fixes bug #6629 reported by Tom Forbes. Backpatch to 9.1. The page splitting code was rewritten in 9.1, and the old code did not have this problem.
* More duplicate word removal.Robert Haas2012-05-02
|
* When a GiST page is split during index build, it might not have a buffer.Heikki Linnakangas2012-03-02
| | | | | | | | | | | | | Previously it was thought that it's impossible as the code stands, because insertions create buffers as tuples are cascaded downwards, and index split also creaters buffers eagerly for all halves. But the example from Jay Levitt demonstrates that it can happen, when the root page is split. It's in fact OK if the buffer doesn't exist, so we just need to remove the sanity check. In fact, we've been discussing the possibility of destroying empty buffers to conserve memory, which would render the sanity check completely useless anyway. Fix by Alexander Korotkov
* Add const qualifiers where they are accidentally cast awayPeter Eisentraut2012-02-28
| | | | | This only produces warnings under -Wcast-qual, but it's more correct and consistent in any case.
* Add some enumeration commas, for consistencyPeter Eisentraut2012-02-24
|
* Throw error sooner for unlogged GiST indexes.Tom Lane2012-02-08
| | | | | | Throwing an error only after we've built the main index fork is pretty unfriendly when the table already contains data. Per gripe from Jay Levitt.
* Assorted comment fixes, mostly just typos, but some obsolete statements.Tom Lane2012-01-29
| | | | YAMAMOTO Takashi
* Update copyright notices for year 2012.Bruce Momjian2012-01-01
|
* Use IEEE infinity, not 1e10, for null-and-not-null case in gistpenalty().Tom Lane2011-11-27
| | | | | | | Use of a randomly chosen large value was never exactly graceful, and now that there are penalty functions that are intentionally using infinity, it doesn't seem like a good idea for null-vs-not-null to be using something less.
* Clean up a couple of box gist helper functions.Heikki Linnakangas2011-10-09
| | | | | | | | | | The original idea of this patch was to make box picksplit run faster, by eliminating unnecessary palloc() overhead, but that was obsoleted by the new double-sorting split algorithm that doesn't call these functions so heavily anymore. Nevertheless, the code looks better this way. Original patch by me, reviewed and tidied up after the double-sorting patch by Kevin Grittner.
* Replace the "New Linear" GiST split algorithm for boxes and points with aHeikki Linnakangas2011-10-06
| | | | | | | new double-sorting algorithm. The new algorithm produces better quality trees, making searches faster. Alexander Korotkov
* Support GiST index support functions that want to cache data across calls.Tom Lane2011-09-30
| | | | | | | | | | | | pg_trgm was already doing this unofficially, but the implementation hadn't been thought through very well and leaked memory. Restructure the core GiST code so that it actually works, and document it. Ordinarily this would have required an extra memory context creation/destruction for each GiST index search, but I was able to avoid that in the normal case of a non-rescanned search by finessing the handling of the RBTree. It used to have its own context always, but now shares a context with the scan-lifespan data structures, unless there is more than one rescan call. This should make the added overhead unnoticeable in typical cases.
* gistendscan() forgot to free so->giststate.Tom Lane2011-09-16
| | | | | | | | | | This oversight led to a massive memory leak --- upwards of 10KB per tuple --- during creation-time verification of an exclusion constraint based on a GIST index. In most other scenarios it'd just be a leak of 10KB that would be recovered at end of query, so not too significant; though perhaps the leak would be noticeable in a situation where a GIST index was being used in a nestloop inner indexscan. In any case, it's a real leak of long standing, so patch all supported branches. Per report from Harald Fuchs.
* In the final emptying phase of the new GiST buffering build, set theHeikki Linnakangas2011-09-12
| | | | | | | | | | queuedForEmptying flag correctly on buffer when adding it to the queue. Also, don't add buffer to the queue if it's there already. These were harmless oversights; failing to set the flag just means that a buffer might get added to the queue twice if more tuples are added to it (although that can't actually happen at this point because all the upper buffers have already been emptied), and having the same buffer twice in the emptying queue is harmless. But better be tidy.
* round() is not portable. Use rint().Tom Lane2011-09-08
|
* Buffering GiST index build algorithm.Heikki Linnakangas2011-09-08
| | | | | | | | | When building a GiST index that doesn't fit in cache, buffers are attached to some internal nodes in the index. This speeds up the build by avoiding random I/O that would otherwise be needed to traverse all the way down the tree to the find right leaf page for tuple. Alexander Korotkov
* Remove unnecessary #include references, per pgrminclude script.Bruce Momjian2011-09-01
|
* Change the way the offset of downlink is stored in GISTInsertStack.Heikki Linnakangas2011-07-15
| | | | | | | | | | | | | | | | | | GISTInsertStack.childoffnum used to mean "offset of the downlink in this node, pointing to the child node in the stack". It's now replaced with downlinkoffnum, which means "offset of the downlink in the parent of this node". gistFindPath() already used childoffnum with this new meaning, and had an extra step at the end to pull all the childoffnum values down one node in the stack, to adjust the stack for the meaning that childoffnum had elsewhere. That's no longer required. The reason to do this now is this new representation is more convenient for the GiST fast build patch that Alexander Korotkov is working on. While we're at it, replace the linked list used in gistFindPath with a standard List, and make gistFindPath() static. Alexander Korotkov, with some changes by me.
* Fix two ancient bugs in GiST code to re-find a parent after page split:Heikki Linnakangas2011-07-15
| | | | | | | | | | | | | | | | | | | | | | | | First, when following a right-link, we incorrectly marked the current page as the parent of the right sibling. In reality, the parent of the right page is the same as the parent of the current page (or some page to the right of it, gistFindCorrectParent() will sort that out). Secondly, when we follow a right-link, we must prepend, not append, the right page to our list of pages to visit. That's because we assume that once we hit a leaf page in the list, all the rest are leaf pages too, and give up. To hit these bugs, you need concurrent actions and several unlucky accidents. Another backend must split the root page, while you're in process of splitting a lower-level page. Furthermore, while you scan the internal nodes to re-find the parent, another backend needs to again split some more internal pages. Even then, the bugs don't necessarily manifest as user-visible errors or index corruption. While we're at it, make the error reporting a bit better if gistFindPath() fails to re-find the parent. It used to be an assertion, but an elog() seems more appropriate. Backpatch to all supported branches.
* Move Trigger and TriggerDesc structs out of rel.h into a new reltrigger.hAlvaro Herrera2011-07-04
| | | | | This lets us stop including rel.h into execnodes.h, which is a widely used header.
* Message style and spelling improvementsPeter Eisentraut2011-06-22
|
* Pgindent run before 9.1 beta2.Bruce Momjian2011-06-09
|
* Protect GIST logic that assumes penalty values can't be negative.Tom Lane2011-05-31
| | | | | | | | | | Apparently sane-looking penalty code might return small negative values, for example because of roundoff error. This will confuse places like gistchoose(). Prevent problems by clamping negative penalty values to zero. (Just to be really sure, I also made it force NaNs to zero.) Back-patch to all supported branches. Alexander Korotkov
* Spell checking and markup refinementPeter Eisentraut2011-05-19
|
* Make GIN and GIST pass the index collation to all their support functions.Tom Lane2011-04-22
| | | | | | | Experimentation with contrib/btree_gist shows that the majority of the GIST support functions potentially need collation information. Safest policy seems to be to pass it to all of them, instead of making assumptions about which ones could possibly need it.
* Pass collations to functions in FunctionCallInfoData, not FmgrInfo.Tom Lane2011-04-12
| | | | | | | | | | | Since collation is effectively an argument, not a property of the function, FmgrInfo is really the wrong place for it; and this becomes critical in cases where a cached FmgrInfo is used for varying purposes that might need different collation settings. Fix by passing it in FunctionCallInfoData instead. In particular this allows a clean fix for bug #5970 (record_cmp not working). This requires touching a bit more code than the original method, but nobody ever thought that collations would not be an invasive patch...
* Clean up most -Wunused-but-set-variable warnings from gcc 4.6Peter Eisentraut2011-04-11
| | | | | | This warning is new in gcc 4.6 and part of -Wall. This patch cleans up most of the noise, but there are some still warnings that are trickier to remove.
* pgindent run before PG 9.1 beta 1.Bruce Momjian2011-04-10
|
* Clean up cruft around collation initialization for tupdescs and scankeys.Tom Lane2011-03-26
| | | | | I found actual bugs in GiST and plpgsql; the rest of this is cosmetic but meant to decrease the odds of future bugs of omission.
* Fix crash in the new GiST insertion code, when an update splits the root page.Heikki Linnakangas2011-01-09
| | | | This bug was exercised by contrib/intarray/bench, as noted by Tom Lane.
* Stamp copyrights for year 2011.Bruce Momjian2011-01-01
|
* Support unlogged tables.Robert Haas2010-12-29
| | | | | | | The contents of an unlogged table are WAL-logged; thus, they are not available on standby servers and are truncated whenever the database system enters recovery. Indexes on unlogged tables are also unlogged. Unlogged GiST indexes are not currently supported.
* Rewrite the GiST insertion logic so that we don't need the post-recoveryHeikki Linnakangas2010-12-23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | cleanup stage to finish incomplete inserts or splits anymore. There was two reasons for the cleanup step: 1. When a new tuple was inserted to a leaf page, the downlink in the parent needed to be updated to contain (ie. to be consistent with) the new key. Updating the parent in turn might require recursively updating the parent of the parent. We now handle that by updating the parent while traversing down the tree, so that when we insert the leaf tuple, all the parents are already consistent with the new key, and the tree is consistent at every step. 2. When a page is split, we need to insert the downlink for the new right page(s), and update the downlink for the original page to not include keys that moved to the right page(s). We now handle that by setting a new flag, F_FOLLOW_RIGHT, on the non-rightmost pages in the split. When that flag is set, scans always follow the rightlink, regardless of the NSN mechanism used to detect concurrent page splits. That way the tree is consistent right after split, even though the downlink is still missing. This is very similar to the way B-tree splits are handled. When the downlink is inserted in the parent, the flag is cleared. To keep the insertion algorithm simple, when an insertion sees an incomplete split, indicated by the F_FOLLOW_RIGHT flag, it finishes the split before doing anything else. These changes allow removing the whole "invalid tuple" mechanism, but I retained the scan code to still follow invalid tuples correctly. While we don't create any such tuples anymore, we want to handle them gracefully in case you pg_upgrade a GiST index that has them. If we encounter any on an insert, though, we just throw an error saying that you need to REINDEX. The issue that got me into doing this is that if you did a checkpoint while an insert or split was in progress, and the checkpoint finishes quickly so that there is no WAL record related to the insert between RedoRecPtr and the checkpoint record, recovery from that checkpoint would not know to finish the incomplete insert. IOW, we have the same issue we solved with the rm_safe_restartpoint mechanism during normal operation too. It's highly unlikely to happen in practice, and this fix is far too large to backpatch, so we're just going to live with in previous versions, but this refactoring fixes it going forward. With this patch, you don't get the annoying 'index "FOO" needs VACUUM or REINDEX to finish crash recovery' notices anymore if you crash at an unfortunate moment.
* Generalize concept of temporary relations to "relation persistence".Robert Haas2010-12-13
| | | | | | | | | | | | | | | This commit replaces pg_class.relistemp with pg_class.relpersistence; and also modifies the RangeVar node type to carry relpersistence rather than istemp. It also removes removes rd_istemp from RelationData and instead performs the correct computation based on relpersistence. For clarity, we add three new macros: RelationNeedsWAL(), RelationUsesLocalBuffers(), and RelationUsesTempNamespace(), so that we can clarify the purpose of each check that previous depended on rd_istemp. This is intended as infrastructure for the upcoming unlogged tables patch, as well as for future possible work on global temporary tables.
* Fix two small bugs in new gistget.c logic.Tom Lane2010-12-04
| | | | | | | | | | | | | 1. Complain, rather than silently doing nothing, if an "invalid" tuple is found on a leaf page. Per off-list discussion with Heikki. 2. Fix oversight in code that removes a GISTSearchItem from the search queue: we have to reset lastHeap if this was the last heap item in the parent GISTSearchTreeItem. Otherwise subsequent additions will do the wrong thing. This was probably masked in early testing because in typical cases the parent item would now be completely empty and would be deleted on next call. You'd need a queued non-leaf page at exactly the same distance as a heap tuple to expose the bug.
* Add external documentation for KNNGIST.Tom Lane2010-12-03
|
* Put back gistgettuple's check for backwards scan request.Tom Lane2010-12-03
| | | | | On reflection it's a bad idea for the KNNGIST patch to have removed that. We don't want it silently returning incorrect answers.
* KNNGIST, otherwise known as order-by-operator support for GIST.Tom Lane2010-12-03
| | | | | | | | | | | | | | | This commit represents a rather heavily editorialized version of Teodor's builtin_knngist_itself-0.8.2 and builtin_knngist_proc-0.8.1 patches. I redid the opclass API to add a separate Distance method instead of turning the Consistent method into an illogical mess, fixed some bit-rot in the rbtree interfaces, and generally worked over the code style and comments. There's still no non-code documentation to speak of, but I'll work on that separately. Some contrib-module changes are also yet to come (right now, point <-> point is the only KNN-ified operator). Teodor Sigaev and Tom Lane
* Create core infrastructure for KNNGIST.Tom Lane2010-12-02
| | | | | | | | | | | | | | | | | | | This is a heavily revised version of builtin_knngist_core-0.9. The ordering operators are no longer mixed in with actual quals, which would have confused not only humans but significant parts of the planner. Instead, ordering operators are carried separately throughout planning and execution. Since the API for ambeginscan and amrescan functions had to be changed anyway, this commit takes the opportunity to rationalize that a bit. RelationGetIndexScan no longer forces a premature index_rescan call; instead, callers of index_beginscan must call index_rescan too. Aside from making the AM-side initialization logic a bit less peculiar, this has the advantage that we do not make a useless extra am_rescan call when there are runtime key values. AMs formerly could not assume that the key values passed to amrescan were actually valid; now they can. Teodor Sigaev and Tom Lane
* Remove useless whitespace at end of linesPeter Eisentraut2010-11-23
|
* The GiST scan algorithm uses LSNs to detect concurrent pages splits, butHeikki Linnakangas2010-11-16
| | | | | | | | | | | | | temporary indexes are not WAL-logged. We used a constant LSN for temporary indexes, on the assumption that we don't need to worry about concurrent page splits in temporary indexes because they're only visible to the current session. But that assumption is wrong, it's possible to insert rows and split pages in the same session, while a scan is in progress. For example, by opening a cursor and fetching some rows, and INSERTing new rows before fetching some more. Fix by generating fake increasing LSNs, used in place of real LSNs in temporary GiST indexes.