aboutsummaryrefslogtreecommitdiff
path: root/src/backend/access/gist/README
diff options
context:
space:
mode:
Diffstat (limited to 'src/backend/access/gist/README')
-rw-r--r--src/backend/access/gist/README181
1 files changed, 111 insertions, 70 deletions
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 03069f02685..2d78dcb0dfa 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -108,43 +108,71 @@ Penalty is used for choosing a subtree to insert; method PickSplit is used for
the node splitting algorithm; method Union is used for propagating changes
upward to maintain the tree properties.
-NOTICE: We modified original INSERT algorithm for performance reason. In
-particularly, it is now a single-pass algorithm.
-
-Function findLeaf is used to identify subtree for insertion. Page, in which
-insertion is proceeded, is locked as well as its parent page. Functions
-findParent and findPath are used to find parent pages, which could be changed
-because of concurrent access. Function pageSplit is recurrent and could split
-page by more than 2 pages, which could be necessary if keys have different
-lengths or more than one key are inserted (in such situation, user defined
-function pickSplit cannot guarantee free space on page).
-
-findLeaf(new-key)
- push(stack, [root, 0]) //page, LSN
- while(true)
- ptr = top of stack
- latch( ptr->page, S-mode )
- ptr->lsn = ptr->page->lsn
- if ( exists ptr->parent AND ptr->parent->lsn < ptr->page->nsn )
- unlatch( ptr->page )
- pop stack
- else if ( ptr->page is not leaf )
- push( stack, [get_best_child(ptr->page, new-key), 0] )
- unlatch( ptr->page )
- else
- unlatch( ptr->page )
- latch( ptr->page, X-mode )
- if ( ptr->page is not leaf )
- //the only root page can become a non-leaf
- unlatch( ptr->page )
- else if ( ptr->parent->lsn < ptr->page->nsn )
- unlatch( ptr->page )
- pop stack
- else
- return stack
- end
- end
- end
+To insert a tuple, we first have to find a suitable leaf page to insert to.
+The algorithm walks down the tree, starting from the root, along the path
+of smallest Penalty. At each step:
+
+1. Has this page been split since we looked at the parent? If so, it's
+possible that we should be inserting to the other half instead, so retreat
+back to the parent.
+2. If this is a leaf node, we've found our target node.
+3. Otherwise use Penalty to pick a new target subtree.
+4. Check the key representing the target subtree. If it doesn't already cover
+the key we're inserting, replace it with the Union of the old downlink key
+and the key being inserted. (Actually, we always call Union, and just skip
+the replacement if the Unioned key is the same as the existing key)
+5. Replacing the key in step 4 might cause the page to be split. In that case,
+propagate the change upwards and restart the algorithm from the first parent
+that didn't need to be split.
+6. Walk down to the target subtree, and goto 1.
+
+This differs from the insertion algorithm in the original paper. In the
+original paper, you first walk down the tree until you reach a leaf page, and
+then you adjust the downlink in the parent, and propagating the adjustment up,
+all the way up to the root in the worst case. But we adjust the downlinks to
+cover the new key already when we walk down, so that when we reach the leaf
+page, we don't need to update the parents anymore, except to insert the
+downlinks if we have to split the page. This makes crash recovery simpler:
+after inserting a key to the page, the tree is immediately self-consistent
+without having to update the parents. Even if we split a page and crash before
+inserting the downlink to the parent, the tree is self-consistent because the
+right half of the split is accessible via the rightlink of the left page
+(which replaced the original page).
+
+Note that the algorithm can walk up and down the tree before reaching a leaf
+page, if internal pages need to split while adjusting the downlinks for the
+new key. Eventually, you should reach the bottom, and proceed with the
+insertion of the new tuple.
+
+Once we've found the target page to insert to, we check if there's room
+for the new tuple. If there is, the tuple is inserted, and we're done.
+If it doesn't fit, however, the page needs to be split. Note that it is
+possible that a page needs to be split into more than two pages, if keys have
+different lengths or more than one key is being inserted at a time (which can
+happen when inserting downlinks for a page split that resulted in more than
+two pages at the lower level). After splitting a page, the parent page needs
+to be updated. The downlink for the new page needs to be inserted, and the
+downlink for the old page, which became the left half of the split, needs to
+be updated to only cover those tuples that stayed on the left page. Inserting
+the downlink in the parent can again lead to a page split, recursing up to the
+root page in the worst case.
+
+gistplacetopage is the workhorse function that performs one step of the
+insertion. If the tuple fits, it inserts it to the given page, otherwise
+it splits the page, and constructs the new downlink tuples for the split
+pages. The caller must then call gistplacetopage() on the parent page to
+insert the downlink tuples. The parent page that holds the downlink to
+the child might have migrated as a result of concurrent splits of the
+parent, gistfindCorrectParent() is used to find the parent page.
+
+Splitting the root page works slightly differently. At root split,
+gistplacetopage() allocates the new child pages and replaces the old root
+page with the new root containing downlinks to the new children, all in one
+operation.
+
+
+findPath is a subroutine of findParent, used when the correct parent page
+can't be found by following the rightlinks at the parent level:
findPath( stack item )
push stack, [root, 0, 0] // page, LSN, parent
@@ -165,9 +193,13 @@ findPath( stack item )
pop stack
end
+
+gistFindCorrectParent is used to re-find the parent of a page during
+insertion. It might have migrated to the right since we traversed down the
+tree because of page splits.
+
findParent( stack item )
parent = item->parent
- latch( parent->page, X-mode )
if ( parent->page->lsn != parent->lsn )
while(true)
search parent tuple on parent->page, if found the return
@@ -181,9 +213,13 @@ findParent( stack item )
end
newstack = findPath( item->parent )
replace part of stack to new one
+ latch( parent->page, X-mode )
return findParent( item )
end
+pageSplit function decides how to distribute keys to the new pages after
+page split:
+
pageSplit(page, allkeys)
(lkeys, rkeys) = pickSplit( allkeys )
if ( page is root )
@@ -204,39 +240,44 @@ pageSplit(page, allkeys)
return newkeys
-placetopage(page, keysarray)
- if ( no space left on page )
- keysarray = pageSplit(page, [ extract_keys(page), keysarray])
- last page in chain gets old NSN,
- original and others - new NSN equals to LSN
- if ( page is root )
- make new root with keysarray
- end
- else
- put keysarray on page
- if ( length of keysarray > 1 )
- keysarray = [ union(keysarray) ]
- end
- end
-insert(new-key)
- stack = findLeaf(new-key)
- keysarray = [new-key]
- ptr = top of stack
- while(true)
- findParent( ptr ) //findParent latches parent page
- keysarray = placetopage(ptr->page, keysarray)
- unlatch( ptr->page )
- pop stack;
- ptr = top of stack
- if (length of keysarray == 1)
- newboundingkey = union(oldboundingkey, keysarray)
- if (newboundingkey == oldboundingkey)
- unlatch ptr->page
- break loop
- end
- end
- end
+Concurrency control
+-------------------
+As a rule of thumb, if you need to hold a lock on multiple pages at the
+same time, the locks should be acquired in the following order: child page
+before parent, and left-to-right at the same level. Always acquiring the
+locks in the same order avoids deadlocks.
+
+The search algorithm only looks at and locks one page at a time. Consequently
+there's a race condition between a search and a page split. A page split
+happens in two phases: 1. The page is split 2. The downlink is inserted to the
+parent. If a search looks at the parent page between those steps, before the
+downlink is inserted, it will still find the new right half by following the
+rightlink on the left half. But it must not follow the rightlink if it saw the
+downlink in the parent, or the page will be visited twice!
+
+A split initially marks the left page with the F_FOLLOW_RIGHT flag. If a scan
+sees that flag set, it knows that the right page is missing the downlink, and
+should be visited too. When split inserts the downlink to the parent, it
+clears the F_FOLLOW_RIGHT flag in the child, and sets the NSN field in the
+child page header to match the LSN of the insertion on the parent. If the
+F_FOLLOW_RIGHT flag is not set, a scan compares the NSN on the child and the
+LSN it saw in the parent. If NSN < LSN, the scan looked at the parent page
+before the downlink was inserted, so it should follow the rightlink. Otherwise
+the scan saw the downlink in the parent page, and will/did follow that as
+usual.
+
+A scan can't normally see a page with the F_FOLLOW_RIGHT flag set, because
+a page split keeps the child pages locked until the downlink has been inserted
+to the parent and the flag cleared again. But if a crash happens in the middle
+of a page split, before the downlinks are inserted into the parent, that will
+leave a page with F_FOLLOW_RIGHT in the tree. Scans handle that just fine,
+but we'll eventually want to fix that for performance reasons. And more
+importantly, dealing with pages with missing downlink pointers in the parent
+would complicate the insertion algorithm. So when an insertion sees a page
+with F_FOLLOW_RIGHT set, it immediately tries to bring the split that
+crashed in the middle to completion by adding the downlink in the parent.
+
Authors:
Teodor Sigaev <teodor@sigaev.ru>