diff options
Diffstat (limited to 'src/backend/access/gist/README')
-rw-r--r-- | src/backend/access/gist/README | 181 |
1 files changed, 111 insertions, 70 deletions
diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README index 03069f02685..2d78dcb0dfa 100644 --- a/src/backend/access/gist/README +++ b/src/backend/access/gist/README @@ -108,43 +108,71 @@ Penalty is used for choosing a subtree to insert; method PickSplit is used for the node splitting algorithm; method Union is used for propagating changes upward to maintain the tree properties. -NOTICE: We modified original INSERT algorithm for performance reason. In -particularly, it is now a single-pass algorithm. - -Function findLeaf is used to identify subtree for insertion. Page, in which -insertion is proceeded, is locked as well as its parent page. Functions -findParent and findPath are used to find parent pages, which could be changed -because of concurrent access. Function pageSplit is recurrent and could split -page by more than 2 pages, which could be necessary if keys have different -lengths or more than one key are inserted (in such situation, user defined -function pickSplit cannot guarantee free space on page). - -findLeaf(new-key) - push(stack, [root, 0]) //page, LSN - while(true) - ptr = top of stack - latch( ptr->page, S-mode ) - ptr->lsn = ptr->page->lsn - if ( exists ptr->parent AND ptr->parent->lsn < ptr->page->nsn ) - unlatch( ptr->page ) - pop stack - else if ( ptr->page is not leaf ) - push( stack, [get_best_child(ptr->page, new-key), 0] ) - unlatch( ptr->page ) - else - unlatch( ptr->page ) - latch( ptr->page, X-mode ) - if ( ptr->page is not leaf ) - //the only root page can become a non-leaf - unlatch( ptr->page ) - else if ( ptr->parent->lsn < ptr->page->nsn ) - unlatch( ptr->page ) - pop stack - else - return stack - end - end - end +To insert a tuple, we first have to find a suitable leaf page to insert to. +The algorithm walks down the tree, starting from the root, along the path +of smallest Penalty. At each step: + +1. Has this page been split since we looked at the parent? If so, it's +possible that we should be inserting to the other half instead, so retreat +back to the parent. +2. If this is a leaf node, we've found our target node. +3. Otherwise use Penalty to pick a new target subtree. +4. Check the key representing the target subtree. If it doesn't already cover +the key we're inserting, replace it with the Union of the old downlink key +and the key being inserted. (Actually, we always call Union, and just skip +the replacement if the Unioned key is the same as the existing key) +5. Replacing the key in step 4 might cause the page to be split. In that case, +propagate the change upwards and restart the algorithm from the first parent +that didn't need to be split. +6. Walk down to the target subtree, and goto 1. + +This differs from the insertion algorithm in the original paper. In the +original paper, you first walk down the tree until you reach a leaf page, and +then you adjust the downlink in the parent, and propagating the adjustment up, +all the way up to the root in the worst case. But we adjust the downlinks to +cover the new key already when we walk down, so that when we reach the leaf +page, we don't need to update the parents anymore, except to insert the +downlinks if we have to split the page. This makes crash recovery simpler: +after inserting a key to the page, the tree is immediately self-consistent +without having to update the parents. Even if we split a page and crash before +inserting the downlink to the parent, the tree is self-consistent because the +right half of the split is accessible via the rightlink of the left page +(which replaced the original page). + +Note that the algorithm can walk up and down the tree before reaching a leaf +page, if internal pages need to split while adjusting the downlinks for the +new key. Eventually, you should reach the bottom, and proceed with the +insertion of the new tuple. + +Once we've found the target page to insert to, we check if there's room +for the new tuple. If there is, the tuple is inserted, and we're done. +If it doesn't fit, however, the page needs to be split. Note that it is +possible that a page needs to be split into more than two pages, if keys have +different lengths or more than one key is being inserted at a time (which can +happen when inserting downlinks for a page split that resulted in more than +two pages at the lower level). After splitting a page, the parent page needs +to be updated. The downlink for the new page needs to be inserted, and the +downlink for the old page, which became the left half of the split, needs to +be updated to only cover those tuples that stayed on the left page. Inserting +the downlink in the parent can again lead to a page split, recursing up to the +root page in the worst case. + +gistplacetopage is the workhorse function that performs one step of the +insertion. If the tuple fits, it inserts it to the given page, otherwise +it splits the page, and constructs the new downlink tuples for the split +pages. The caller must then call gistplacetopage() on the parent page to +insert the downlink tuples. The parent page that holds the downlink to +the child might have migrated as a result of concurrent splits of the +parent, gistfindCorrectParent() is used to find the parent page. + +Splitting the root page works slightly differently. At root split, +gistplacetopage() allocates the new child pages and replaces the old root +page with the new root containing downlinks to the new children, all in one +operation. + + +findPath is a subroutine of findParent, used when the correct parent page +can't be found by following the rightlinks at the parent level: findPath( stack item ) push stack, [root, 0, 0] // page, LSN, parent @@ -165,9 +193,13 @@ findPath( stack item ) pop stack end + +gistFindCorrectParent is used to re-find the parent of a page during +insertion. It might have migrated to the right since we traversed down the +tree because of page splits. + findParent( stack item ) parent = item->parent - latch( parent->page, X-mode ) if ( parent->page->lsn != parent->lsn ) while(true) search parent tuple on parent->page, if found the return @@ -181,9 +213,13 @@ findParent( stack item ) end newstack = findPath( item->parent ) replace part of stack to new one + latch( parent->page, X-mode ) return findParent( item ) end +pageSplit function decides how to distribute keys to the new pages after +page split: + pageSplit(page, allkeys) (lkeys, rkeys) = pickSplit( allkeys ) if ( page is root ) @@ -204,39 +240,44 @@ pageSplit(page, allkeys) return newkeys -placetopage(page, keysarray) - if ( no space left on page ) - keysarray = pageSplit(page, [ extract_keys(page), keysarray]) - last page in chain gets old NSN, - original and others - new NSN equals to LSN - if ( page is root ) - make new root with keysarray - end - else - put keysarray on page - if ( length of keysarray > 1 ) - keysarray = [ union(keysarray) ] - end - end -insert(new-key) - stack = findLeaf(new-key) - keysarray = [new-key] - ptr = top of stack - while(true) - findParent( ptr ) //findParent latches parent page - keysarray = placetopage(ptr->page, keysarray) - unlatch( ptr->page ) - pop stack; - ptr = top of stack - if (length of keysarray == 1) - newboundingkey = union(oldboundingkey, keysarray) - if (newboundingkey == oldboundingkey) - unlatch ptr->page - break loop - end - end - end +Concurrency control +------------------- +As a rule of thumb, if you need to hold a lock on multiple pages at the +same time, the locks should be acquired in the following order: child page +before parent, and left-to-right at the same level. Always acquiring the +locks in the same order avoids deadlocks. + +The search algorithm only looks at and locks one page at a time. Consequently +there's a race condition between a search and a page split. A page split +happens in two phases: 1. The page is split 2. The downlink is inserted to the +parent. If a search looks at the parent page between those steps, before the +downlink is inserted, it will still find the new right half by following the +rightlink on the left half. But it must not follow the rightlink if it saw the +downlink in the parent, or the page will be visited twice! + +A split initially marks the left page with the F_FOLLOW_RIGHT flag. If a scan +sees that flag set, it knows that the right page is missing the downlink, and +should be visited too. When split inserts the downlink to the parent, it +clears the F_FOLLOW_RIGHT flag in the child, and sets the NSN field in the +child page header to match the LSN of the insertion on the parent. If the +F_FOLLOW_RIGHT flag is not set, a scan compares the NSN on the child and the +LSN it saw in the parent. If NSN < LSN, the scan looked at the parent page +before the downlink was inserted, so it should follow the rightlink. Otherwise +the scan saw the downlink in the parent page, and will/did follow that as +usual. + +A scan can't normally see a page with the F_FOLLOW_RIGHT flag set, because +a page split keeps the child pages locked until the downlink has been inserted +to the parent and the flag cleared again. But if a crash happens in the middle +of a page split, before the downlinks are inserted into the parent, that will +leave a page with F_FOLLOW_RIGHT in the tree. Scans handle that just fine, +but we'll eventually want to fix that for performance reasons. And more +importantly, dealing with pages with missing downlink pointers in the parent +would complicate the insertion algorithm. So when an insertion sees a page +with F_FOLLOW_RIGHT set, it immediately tries to bring the split that +crashed in the middle to completion by adding the downlink in the parent. + Authors: Teodor Sigaev <teodor@sigaev.ru> |