aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/spgist
Commit message (Collapse)AuthorAge
...
* Refactor per-page logic common to all redo routines to a new function.Heikki Linnakangas2014-09-02
| | | | | | | | | | | | Every redo routine uses the same idiom to determine what to do to a page: check if there's a backup block for it, and if not read, the buffer if the block exists, and check its LSN. Refactor that into a common function, XLogReadBufferForRedo, making all the redo routines shorter and more readable. This has no user-visible effect, and makes no changes to the WAL format. Reviewed by Andres Freund, Alvaro Herrera, Michael Paquier.
* Move log_newpage and log_newpage_buffer to xlog.c.Heikki Linnakangas2014-07-31
| | | | | | | | | | | log_newpage is used by many indexams, in addition to heap, but for historical reasons it's always been part of the heapam rmgr. Starting with 9.3, we have another WAL record type for logging an image of a page, XLOG_FPI. Simplify things by moving log_newpage and log_newpage_buffer to xlog.c, and switch to using the XLOG_FPI record type. Bump the WAL version number because the code to replay the old HEAP_NEWPAGE records is removed.
* Fix infinite loop when splitting inner tuples in SPGiST text indexes.Tom Lane2014-06-09
| | | | | | | | | | | | | | | | | | | | | | | Previously, the code used a node label of zero both for strings that contain no bytes beyond the inner tuple's prefix, and for cases where an "allTheSame" inner tuple has to be split to allow a string with a different next byte to be inserted into it. Failing to distinguish these cases meant that if a string ending with the current prefix needed to be inserted into an allTheSame tuple, we got into an infinite loop, because after splitting the tuple we'd descend into the child allTheSame tuple and then find we need to split again. To fix, instead use -1 and -2 as the node labels for these two cases. This requires widening the node label type from "char" to int2, but fortunately SPGiST stores all pass-by-value node label types in their Datum representation, which means that this change is transparently upward compatible so far as the on-disk representation goes. We continue to recognize zero as a dummy node label for reading purposes, but will not attempt to push new index entries down into such a label, so that the loop won't occur even when dealing with an existing index. Per report from Teodor Sigaev. Back-patch to 9.2 where the faulty code was introduced.
* Adjust SP-GiST WAL record formats to reduce alignment padding.Heikki Linnakangas2014-06-05
| | | | | | | | | | | | | | | | | | The way the code was written, the padding was copied from uninitialized memory areas.. Because the structs are local variables in the code where the WAL records are constructed, making them larger and zeroing the padding bytes would not make the code very pretty, so rather than fixing this directly by zeroing out the padding bytes, it seems more clear to not try to align the tuples in the WAL records. The redo functions are taught to copy the tuple header to a local variable to avoid unaligned access. Stable-branches have the same problem, but we can't change the WAL format there, so fix in master only. Reading a few random extra bytes at the stack is harmless in practice, so it's not worth crafting a different back-patchable fix. Per reports from Kevin Grittner and Andres Freund, using clang static analyzer and Valgrind, respectively.
* pgindent run for 9.4Bruce Momjian2014-05-06
| | | | | This includes removing tabs after periods in C comments, which was applied to back branches, so this change should not effect backpatching.
* Avoid allocations in critical sections.Heikki Linnakangas2014-04-04
| | | | If a palloc in a critical section fails, it becomes a PANIC.
* Allow use of "z" flag in our printf calls, and use it where appropriate.Tom Lane2014-01-23
| | | | | | | | | | | | | | | | | | | | | | | | | Since C99, it's been standard for printf and friends to accept a "z" size modifier, meaning "whatever size size_t has". Up to now we've generally dealt with printing size_t values by explicitly casting them to unsigned long and using the "l" modifier; but this is really the wrong thing on platforms where pointers are wider than longs (such as Win64). So let's start using "z" instead. To ensure we can do that on all platforms, teach src/port/snprintf.c to understand "z", and add a configure test to force use of that implementation when the platform's version doesn't handle "z". Having done that, modify a bunch of places that were using the unsigned-long hack to use "z" instead. This patch doesn't pretend to have gotten everyplace that could benefit, but it catches many of them. I made an effort in particular to ensure that all uses of the same error message text were updated together, so as not to increase the number of translatable strings. It's possible that this change will result in format-string warnings from pre-C99 compilers. We might have to reconsider if there are any popular compilers that will warn about this; but let's start by seeing what the buildfarm thinks. Andres Freund, with a little additional work by me
* Update copyright for 2014Bruce Momjian2014-01-07
| | | | | Update all files in head, and files COPYRIGHT and legal.sgml in all back branches.
* Don't include unused space in LOG_NEWPAGE records.Heikki Linnakangas2013-12-04
| | | | | This is the same trick we use when taking a full page image of a buffer passed to XLogInsert.
* Retry after buffer locking failure during SPGiST index creation.Tom Lane2013-11-02
| | | | | | | | | The original coding thought this case was impossible, but it can happen if the bgwriter or checkpointer processes decide to write out an index page while creation is still proceeding, leading to a bogus "unexpected spgdoinsert() failure" error. Problem reported by Jonathan S. Katz. Teodor Sigaev
* Typo fix.Robert Haas2013-09-18
| | | | Etsuro Fujita
* Avoid deadlocks during insertion into SP-GiST indexes.Tom Lane2013-06-14
| | | | | | | | | | | | | SP-GiST's original scheme for avoiding deadlocks during concurrent index insertions doesn't work, as per report from Hailong Li, and there isn't any evident way to make it work completely. We could possibly lock individual inner tuples instead of their whole pages, but preliminary experimentation suggests that the performance penalty would be huge. Instead, if we fail to get a buffer lock while descending the tree, just restart the tree descent altogether. We keep the old tuple positioning rules, though, in hopes of reducing the number of cases where this can happen. Teodor Sigaev, somewhat edited by Tom Lane
* pgindent run for release 9.3Bruce Momjian2013-05-29
| | | | | This is the first run of the Perl-based pgindent script. Also update pgindent instructions.
* Use the term "radix tree" instead of "suffix tree" for SP-GiST text opclass.Heikki Linnakangas2013-05-08
| | | | | | | What we have implemented is a radix tree (or a radix trie or a patricia trie), but the docs and code comments incorrectly called it a "suffix tree". Alexander Korotkov
* Allow I/O reliability checks using 16-bit checksumsSimon Riggs2013-03-22
| | | | | | | | | | | | | | | | | | | Checksums are set immediately prior to flush out of shared buffers and checked when pages are read in again. Hint bit setting will require full page write when block is dirtied, which causes various infrastructure changes. Extensive comments, docs and README. WARNING message thrown if checksum fails on non-all zeroes page; ERROR thrown but can be disabled with ignore_checksum_failure = on. Feature enabled by an initdb option, since transition from option off to option on is long and complex and has not yet been implemented. Default is not to use checksums. Checksum used is WAL CRC-32 truncated to 16-bits. Simon Riggs, Jeff Davis, Greg Smith Wide input and assistance from many community members. Thank you.
* Remove PageSetTLI and rename pd_tli to pd_checksumSimon Riggs2013-03-18
| | | | | | | | | | | | | | Remove use of PageSetTLI() from all page manipulation functions and adjust README to indicate change in the way we make changes to pages. Repurpose those bytes into the pd_checksum field and explain how that works in comments about page header. Refactoring ahead of actual feature patch which would make use of the checksum field, arriving later. Jeff Davis, with comments and doc changes by Simon Riggs Direction suggested by Robert Haas; many others providing review comments.
* Update copyrights for 2013Bruce Momjian2013-01-01
| | | | | Fully update git head, and update back branches in ./COPYRIGHT and legal.sgml files.
* Remove obsolete XLogRecPtr macrosAlvaro Herrera2012-12-28
| | | | | | | | | | | | | | | | | This gets rid of XLByteLT, XLByteLE, XLByteEQ and XLByteAdvance. These were useful for brevity when XLogRecPtrs were split in xlogid/xrecoff; but now that they are simple uint64's, they are just clutter. The only downside to making this change would be ease of backporting patches, but that has been negated by other substantive changes to the involved code anyway. The clarity of simpler expressions makes the change worthwhile. Most of the changes are mechanical, but in a couple of places, the patch author chose to invert the operator sense, making the code flow more logical (and more in line with preceding comments). Author: Andres Freund Eyeballed by Dimitri Fontaine and Alvaro Herrera
* Split out rmgr rm_desc functions into their own filesAlvaro Herrera2012-11-28
| | | | | This is necessary (but not sufficient) to have them compilable outside of a backend environment.
* Fix multiple problems in WAL replay.Tom Lane2012-11-12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
* Trim spgist_private.h inclusionAlvaro Herrera2012-09-05
| | | | It doesn't really need rel.h; relcache.h is enough.
* Optimize SP-GiST insertions.Heikki Linnakangas2012-08-29
| | | | | | | This includes two micro-optimizations to the tight inner loop in descending the SP-GiST tree: 1. avoid an extra function call to index_getprocinfo when calling user-defined choose function, and 2. avoid a useless palloc+pfree when node labels are not used.
* Split heapam_xlog.h from heapam.hAlvaro Herrera2012-08-28
| | | | | | | | | | | | The heapam XLog functions are used by other modules, not all of which are interested in the rest of the heapam API. With this, we let them get just the XLog stuff in which they are interested and not pollute them with unrelated includes. Also, since heapam.h no longer requires xlog.h, many files that do include heapam.h no longer get xlog.h automatically, including a few headers. This is useful because heapam.h is getting pulled in by execnodes.h, which is in turn included by a lot of files.
* In SPGiST replay, do conflict resolution before modifying the page.Tom Lane2012-08-03
| | | | | | | | | In yesterday's commit 962e0cc71e839c58fb9125fa85511b8bbb8bdbee, I added the ResolveRecoveryConflictWithSnapshot call in the wrong place. I correctly put it before spgRedoVacuumRedirect itself would modify the index page --- but not before RestoreBkpBlocks, so replay of a record with a full-page image would modify the page before kicking off any conflicting HS transactions. Oops.
* Fix race conditions associated with SPGiST redirection tuples.Tom Lane2012-08-02
| | | | | | | | | | | | | | | The correct test for whether a redirection tuple is removable is whether tuple's xid < RecentGlobalXmin, not OldestXmin; the previous coding failed to protect index searches being done in concurrent transactions that have no XID. This mirrors the recent fix in btree's page recycling logic made in commit d3abbbebe52eb1e59e621c880ad57df9d40d13f2. Also, WAL-log the newest XID of any removed redirection tuple on an index page, and apply ResolveRecoveryConflictWithSnapshot during InHotStandby WAL replay. This protects against concurrent Hot Standby transactions possibly needing to see the redirection tuple(s). Per my query of 2012-03-12 and subsequent discussion.
* Assorted message style improvementsPeter Eisentraut2012-07-02
|
* Cope with smaller-than-normal BLCKSZ setting in SPGiST indexes on text.Tom Lane2012-06-26
| | | | | | | | | | | | The original coding failed miserably for BLCKSZ of 4K or less, as reported by Josh Kupershmidt. With the present design for text indexes, a given inner tuple could have up to 256 labels (requiring either 3K or 4K bytes depending on MAXALIGN), which means that we can't positively guarantee no failures for smaller blocksizes. But we can at least make it behave sanely so long as there are few enough labels to fit on a page. Considering that btree is also more prone to "index tuple too large" failures when BLCKSZ is small, it's not clear that we should expend more work than this on this case.
* Replace int2/int4 in C code with int16/int32Peter Eisentraut2012-06-25
| | | | | | | | | | The latter was already the dominant use, and it's preferable because in C the convention is that intXX means XX bits. Therefore, allowing mixed use of int2, int4, int8, int16, int32 is obviously confusing. Remove the typedefs for int2 and int4 for now. They don't seem to be widely used outside of the PostgreSQL source tree, and the few uses can probably be cleaned up by the time this ships.
* Run pgindent on 9.2 source tree in preparation for first 9.3Bruce Momjian2012-06-10
| | | | commit-fest.
* Remove duplicate words in comments.Heikki Linnakangas2012-05-02
| | | | Found these with grep -r "for for ".
* Fix SPGiST vacuum algorithm to handle concurrent tuple motion properly.Tom Lane2012-03-12
| | | | | | | | | | | | | A leaf tuple that we need to delete could get moved as a consequence of an insertion happening concurrently with the VACUUM scan. If it moves from a page past the current scan point to a page before, we'll miss it, which is not acceptable. Hence, when we see a leaf-page REDIRECT that could have been made since our scan started, chase down the redirection pointer much as if we were doing a normal index search, and be sure to vacuum every page it leads to. This fixes the issue because, if the tuple was on page N at the instant we start our scan, we will surely find it as a consequence of chasing the redirect from page N, no matter how much it moves around in between. Problem noted by Takashi Yamamoto.
* Teach SPGiST to store nulls and do whole-index scans.Tom Lane2012-03-11
| | | | | | | | | | | | | | | | | | | | | This patch fixes the other major compatibility-breaking limitation of SPGiST, that it didn't store anything for null values of the indexed column, and so could not support whole-index scans or "x IS NULL" tests. The approach is to create a wholly separate search tree for the null entries, and use fixed "allTheSame" insertion and search rules when processing this tree, instead of calling the index opclass methods. This way the opclass methods do not need to worry about dealing with nulls. Catversion bump is for pg_am updates as well as the change in on-disk format of SPGiST indexes; there are some tweaks in SPGiST WAL records as well. Heavily rewritten version of a patch by Oleg Bartunov and Teodor Sigaev. (The original also stored nulls separately, but it reused GIN code to do so; which required undesirable compromises in the on-disk format, and would likely lead to bugs due to the GIN code being required to work in two very different contexts.)
* Restructure SPGiST opclass interface API to support whole-index scans.Tom Lane2012-03-10
| | | | | | | | | | | | | | | | | | | | The original API definition was incapable of supporting whole-index scans because there was no way to invoke leaf-value reconstruction without checking any qual conditions. Also, it was inefficient for multiple-qual-condition scans because value reconstruction got done over again for each qual condition, and because other internal work in the consistent functions likewise had to be done for each qual. To fix these issues, pass the whole scankey array to the opclass consistent functions, instead of only letting them see one item at a time. (Essentially, the loop over scankey entries is now inside the consistent functions not outside them. This makes the consistent functions a bit more complicated, but not unreasonably so.) In itself this commit does nothing except save a few cycles in multiple-qual-condition index scans, since we can't support whole-index scans on SPGiST indexes until nulls are included in the index. However, I consider this a must-fix for 9.2 because once we release it will get very much harder to change the opclass API definition.
* Update copyright notices for year 2012.Bruce Momjian2012-01-01
|
* Rename updateNodeLink to spgUpdateNodeLink.Tom Lane2011-12-19
| | | | | | | On reflection, the original name seems way too generic for a global symbol. A quick check shows this is the only exported function name in SP-GiST that doesn't begin with "spg" or contain "SpGist", so the rest of them seem all right.
* Teach SP-GiST to do index-only scans.Tom Lane2011-12-19
| | | | | | | | | | | | Operator classes can specify whether or not they support this; this preserves the flexibility to use lossy representations within an index. In passing, move constant data about a given index into the rd_amcache cache area, instead of doing fresh lookups each time we start an index operation. This is mainly to try to make sure that spgcanreturn() has insignificant cost; I still don't have any proof that it matters for actual index accesses. Also, get rid of useless copying of FmgrInfo pointers; we can perfectly well use the relcache's versions in-place.
* Replace simple constant pg_am.amcanreturn with an AM support function.Tom Lane2011-12-18
| | | | | | | | | The need for this was debated when we put in the index-only-scan feature, but at the time we had no near-term expectation of having AMs that could support such scans for only some indexes; so we kept it simple. However, the SP-GiST AM forces the issue, so let's fix it. This patch only installs the new API; no behavior actually changes.
* Defend against null scankeys in spgist searches.Tom Lane2011-12-17
| | | | Should've thought of that one earlier.
* Fix compiler warning seen on 64-bit machine.Tom Lane2011-12-17
|
* Add SP-GiST (space-partitioned GiST) index access method.Tom Lane2011-12-17
SP-GiST is comparable to GiST in flexibility, but supports non-balanced partitioned search structures rather than balanced trees. As described at PGCon 2011, this new indexing structure can beat GiST in both index build time and query speed for search problems that it is well matched to. There are a number of areas that could still use improvement, but at this point the code seems committable. Teodor Sigaev and Oleg Bartunov, with considerable revisions by Tom Lane