| Commit message (Collapse) | Author | Age |
|
|
|
| |
Backpatch-through: certain files through 9.4
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If you extend a relation, it should count as a block written, not
read (we write a zero-filled block). If you ask for a zero-filled
buffer, it shouldn't be counted as read or written.
Later we might consider counting zero-filled buffers with a separate
counter, if they become more common due to future work.
Author: Thomas Munro
Reviewed-by: Haribabu Kommi, Kyotaro Horiguchi, David Rowley
Discussion: https://postgr.es/m/CAEepm%3D3JytB3KPpvSwXzkY%2Bdwc5zC8P8Lk7Nedkoci81_0E9rA%40mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It seems incorrect to assume that the list of CkptSortItems can never
contain duplicate page numbers: concurrent activity could result in some
page getting dropped from a low-numbered buffer and later loaded into a
high-numbered buffer while BufferSync is scanning the buffer pool.
If that happened, the comparator would give self-inconsistent results,
potentially confusing qsort(). Saving one comparison step is not worth
possibly getting the sort wrong.
So far as I can tell, nothing would actually go wrong given our current
implementation of qsort(). It might get a bit slower than expected
if there were a large number of duplicates of one value, but that's
surely a probability-epsilon case. Still, the comment is wrong,
and if we ever switched to another sort implementation it might be
less forgiving.
In passing, avoid casting away const-ness of the argument pointers;
I've not seen any compiler complaints from that, but it seems likely
that some compilers would not like it.
Back-patch to 9.6 where this code came in, just in case I've underestimated
the possible consequences.
Discussion: https://postgr.es/m/18437.1515607610@sss.pgh.pa.us
|
|
|
|
| |
Backpatch-through: certain files through 9.3
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The point of having separate ResourceOwnerEnlargeFoo and
ResourceOwnerRememberFoo functions is so that resource allocation
can happen in between. Doing it in some other order is just wrong.
OpenTemporaryFile() did open(), enlarge, remember, which would leak the
open file if the enlarge step ran out of memory. Because fd.c has its own
layer of resource-remembering, the consequences look like they'd be limited
to an intratransaction FD leak, but it's still not good.
IncrBufferRefCount() did enlarge, remember, incr-refcount, which would blow
up if the incr-refcount step ever failed. It was safe enough when written,
but since the introduction of PrivateRefCountHash, I think the assumption
that no error could happen there is pretty shaky.
The odds of real problems from either bug are probably small, but still,
back-patch to supported branches.
Thomas Munro and Tom Lane, per a comment from Andres Freund
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The lower case spellings are C and C++ standard and are used in most
parts of the PostgreSQL sources. The upper case spellings are only used
in some files/modules. So standardize on the standard spellings.
The APIs for ICU, Perl, and Windows define their own TRUE and FALSE, so
those are left as is when using those APIs.
In code comments, we use the lower-case spelling for the C concepts and
keep the upper-case spelling for the SQL concepts.
Reviewed-by: Michael Paquier <michael.paquier@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Periodically while the server is running, and at shutdown, write out a
list of blocks in shared buffers. When the server reaches consistency
-- unfortunatey, we can't do it before that point without breaking
things -- reload those blocks into any still-unused shared buffers.
Mithun Cy and Robert Haas, reviewed and tested by Beena Emerson,
Amit Kapila, Jim Nasby, and Rafia Sabih.
Discussion: http://postgr.es/m/CAD__OugubOs1Vy7kgF6xTjmEqTR4CrGAv8w+ZbaY_+MZeitukw@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Don't move parenthesized lines to the left, even if that means they
flow past the right margin.
By default, BSD indent lines up statement continuation lines that are
within parentheses so that they start just to the right of the preceding
left parenthesis. However, traditionally, if that resulted in the
continuation line extending to the right of the desired right margin,
then indent would push it left just far enough to not overrun the margin,
if it could do so without making the continuation line start to the left of
the current statement indent. That makes for a weird mix of indentations
unless one has been completely rigid about never violating the 80-column
limit.
This behavior has been pretty universally panned by Postgres developers.
Hence, disable it with indent's new -lpl switch, so that parenthesized
lines are always lined up with the preceding left paren.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change pg_bsd_indent to follow upstream rules for placement of comments
to the right of code, and remove pgindent hack that caused comments
following #endif to not obey the general rule.
Commit e3860ffa4dd0dad0dd9eea4be9cc1412373a8c89 wasn't actually using
the published version of pg_bsd_indent, but a hacked-up version that
tried to minimize the amount of movement of comments to the right of
code. The situation of interest is where such a comment has to be
moved to the right of its default placement at column 33 because there's
code there. BSD indent has always moved right in units of tab stops
in such cases --- but in the previous incarnation, indent was working
in 8-space tab stops, while now it knows we use 4-space tabs. So the
net result is that in about half the cases, such comments are placed
one tab stop left of before. This is better all around: it leaves
more room on the line for comment text, and it means that in such
cases the comment uniformly starts at the next 4-space tab stop after
the code, rather than sometimes one and sometimes two tabs after.
Also, ensure that comments following #endif are indented the same
as comments following other preprocessor commands such as #else.
That inconsistency turns out to have been self-inflicted damage
from a poorly-thought-through post-indent "fixup" in pgindent.
This patch is much less interesting than the first round of indent
changes, but also bulkier, so I thought it best to separate the effects.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The new indent version includes numerous fixes thanks to Piotr Stefaniak.
The main changes visible in this commit are:
* Nicer formatting of function-pointer declarations.
* No longer unexpectedly removes spaces in expressions using casts,
sizeof, or offsetof.
* No longer wants to add a space in "struct structname *varname", as
well as some similar cases for const- or volatile-qualified pointers.
* Declarations using PG_USED_FOR_ASSERTS_ONLY are formatted more nicely.
* Fixes bug where comments following declarations were sometimes placed
with no space separating them from the code.
* Fixes some odd decisions for comments following case labels.
* Fixes some cases where comments following code were indented to less
than the expected column 33.
On the less good side, it now tends to put more whitespace around typedef
names that are not listed in typedefs.list. This might encourage us to
put more effort into typedef name collection; it's not really a bug in
indent itself.
There are more changes coming after this round, having to do with comment
indentation and alignment of lines appearing within parentheses. I wanted
to limit the size of the diffs to something that could be reviewed without
one's eyes completely glazing over, so it seemed better to split up the
changes as much as practical.
Discussion: https://postgr.es/m/E1dAmxK-0006EE-1r@gemulon.postgresql.org
Discussion: https://postgr.es/m/30527.1495162840@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
| |
this makes buffer access strategy have no effect.
Change was a part of commit 48354581a49c30f5757c203415aa8412d85b0f70 during 9.6
release cycle, so backpath to 9.6
Reported-by: Jim Nasby
Author: Alexander Korotkov
Reviewed-by: Jim Nasby, Andres Freund
https://commitfest.postgresql.org/13/1029/
|
|
|
|
|
|
|
|
|
|
|
| |
Names containing the letters "log" sometimes confuse users into
believing that only non-critical data is present. It is hoped
this renaming will discourage ill-considered removals of transaction
status data.
Michael Paquier
Discussion: http://postgr.es/m/CA+Tgmoa9xFQyjRZupbdEFuwUerFTvC6HjZq1ud6GYragGDFFgA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This could result in corruption of the init fork of an unlogged index
if the ambuildempty routine for that index used shared buffers to
create the init fork, which was true for brin, gin, gist, and hash
indexes.
Patch by me, based on an earlier patch by Michael Paquier, who also
reviewed this one. This also incorporates an idea from Artur
Zakirov.
Discussion: http://postgr.es/m/CACYUyc8yccE4xfxhqxfh_Mh38j7dRFuxfaK1p6dSNAEUakxUyQ@mail.gmail.com
|
|
|
|
|
|
|
|
| |
The interface for the function was changed in
d72731a70450b5e7084991b9caa15cb58a2820df but the comments of the
function was not updated.
Patch by Yugo Nagata.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
array_base and array_stride were added so that we could identify the
offset of an LWLock within a tranche, but this facility is only very
marginally used apart from the main tranche. So, give every lock in
the main tranche its own tranche ID and get rid of array_base,
array_stride, and all that's attached. For debugging facilities
(Trace_lwlocks and LWLOCK_STATS) print the pointer address of the
LWLock using %p instead of the offset. This is arguably more useful,
and certainly a lot cheaper. Drop the offset-within-tranche from
the information reported to dtrace and from one can't-happen message
inside lwlock.c.
The main user-visible impact of this change is that pg_stat_activity
will now report all waits for LWLocks as "LWLock" rather than
reporting some as "LWLockTranche" and others as "LWLockNamed".
The main motivation for this change is that the need to specify an
array_base and an array_stride is awkward for parallel query. There
is only a very limited supply of tranche IDs so we can't just keep
allocating new ones, and if we try to use the same tranche IDs every
time then we run into trouble when multiple parallel contexts are
use simultaneously. So if we didn't get rid of this mechanism we'd
have to make it even more complicated. By simplifying it in this
way, we instead reduce the size of the generated code for lwlock.c
by about 5%.
Discussion: http://postgr.es/m/CA+TgmoYsFn6NUW1x0AZtupJGUAs1UDY4dJtCN47_Q6D0sP80PA@mail.gmail.com
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LockBufferForCleanup() acquires a cleanup lock unconditionally, and
ConditionalLockBufferForCleanup() acquires a cleanup lock if it is
possible to do so without waiting; this patch adds a new API,
IsBufferCleanupOK(), which tests whether an exclusive lock already
held happens to be a cleanup lock. This is possible because a cleanup
lock simply means an exclusive lock plus the assurance any other pins
on the buffer are newer than our own pin. Therefore, just as the
existing functions decide that the exclusive lock that they've just
taken is a cleanup lock if they observe the pin count to be 1, this
new function allows us to observe that the pin count is 1 on a buffer
we've already locked.
This is useful in situations where a backend definitely wishes to
modify the buffer and also wishes to perform cleanup operations if
possible. The patch to eliminate heavyweight locking by hash indexes
uses this, and it may have other applications as well.
Amit Kapila, per a suggestion from me. Some comment adjustments by me
as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I somehow had assumed that in the spinlock (in turn possibly using
semaphores) based fallback atomics implementation 32 bit writes could be
done without a lock. As far as the write goes that's correct, since
postgres supports only platforms with single-copy atomicity for aligned
32bit writes. But writing without holding the spinlock breaks
read-modify-write operations like pg_atomic_compare_exchange_u32(),
since they'll potentially "miss" a concurrent write, which can't happen
in actual hardware implementations.
In 9.6+ when using the fallback atomics implementation this could lead
to buffer header locks not being properly marked as released, and
potentially some related state corruption. I don't see a related danger
in 9.5 (earliest release with the API), because pg_atomic_write_u32()
wasn't used in a concurrent manner there.
The state variable of local buffers, before this change, were
manipulated using pg_atomic_write_u32(), to avoid unnecessary
synchronization overhead. As that'd not be the case anymore, introduce
and use pg_atomic_unlocked_write_u32(), which does not correctly
interact with RMW operations.
This bug only caused issues when postgres is compiled on platforms
without atomics support (i.e. no common new platform), or when compiled
with --disable-atomics, which explains why this wasn't noticed in
testing.
Reported-By: Tom Lane
Discussion: <14947.1475690465@sss.pgh.pa.us>
Backpatch: 9.5-, where the atomic operations API was introduced.
|
|
|
|
|
|
|
|
| |
Windows apparently has a constant named WAIT_TIMEOUT, and some of these
other names are pretty generic, too. Insert "PG_" at the front of each
name in order to disambiguate.
Michael Paquier
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
WaitLatch, WaitLatchOrSocket, and WaitEventSetWait now taken an
additional wait_event_info parameter; legal values are defined in
pgstat.h. This makes it possible to uniquely identify every point in
the core code where we are waiting for a latch; extensions can pass
WAIT_EXTENSION.
Because latches were the major wait primitive not previously covered
by this patch, it is now possible to see information in
pg_stat_activity on a large number of important wait events not
previously addressed, such as ClientRead, ClientWrite, and SyncRep.
Unfortunately, many of the wait events added by this patch will fail
to appear in pg_stat_activity because they're only used in background
processes which don't currently appear in pg_stat_activity. We should
fix this either by creating a separate view for such information, or
else by deciding to include them in pg_stat_activity after all.
Michael Paquier and Robert Haas, reviewed by Alexander Korotkov and
Thomas Munro.
|
|
|
|
|
|
|
| |
Tests whether my process holds a lock in given mode.
Add initial usage in MarkBufferDirty().
Thomas Munro
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I found that half a dozen (nearly 5%) of our AllocSetContextCreate calls
had typos in the context-sizing parameters. While none of these led to
especially significant problems, they did create minor inefficiencies,
and it's now clear that expecting people to copy-and-paste those calls
accurately is not a great idea. Let's reduce the risk of future errors
by introducing single macros that encapsulate the common use-cases.
Three such macros are enough to cover all but two special-purpose contexts;
those two calls can be left as-is, I think.
While this patch doesn't in itself improve matters for third-party
extensions, it doesn't break anything for them either, and they can
gradually adopt the simplified notation over time.
In passing, change TopMemoryContext to use the default allocation
parameters. Formerly it could only be extended 8K at a time. That was
probably reasonable when this code was written; but nowadays we create
many more contexts than we did then, so that it's not unusual to have a
couple hundred K in TopMemoryContext, even without considering various
dubious code that sticks other things there. There seems no good reason
not to let it use growing blocks like most other contexts.
Back-patch to 9.6, mostly because that's still close enough to HEAD that
it's easy to do so, and keeping the branches in sync can be expected to
avoid some future back-patching pain. The bugs fixed by these changes
don't seem to be significant enough to justify fixing them further back.
Discussion: <21072.1472321324@sss.pgh.pa.us>
|
|
|
|
|
| |
Author: Masahiko Sawada
Discussion: CAD21AoBD=Of1OzL90Xx4Q-3j=-2q7=S87cs75HfutE=eCday2w@mail.gmail.com
|
|
|
|
|
|
|
| |
Commit b8fd1a09f3 renamed XLOG_HINT to XLOG_FPI, but neglected two
places.
Backpatch to 9.3, like that commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since indexes are created without valid LSNs, an index created
while a snapshot older than old_snapshot_threshold existed could
cause queries to return incorrect results when those old snapshots
were used, if any relevant rows had been subject to early pruning
before the index was built. Prevent usage of a newly created index
until all such snapshots are released, for relations where this can
happen.
Questions about the interaction of "snapshot too old" with index
creation were initially raised by Andres Freund.
Reviewed by Robert Haas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Transmit the leader's temp-namespace state to workers. This is important
because without it, the workers do not really have the same search path
as the leader. For example, there is no good reason (and no extant code
either) to prevent a worker from executing a temp function that the
leader created previously; but as things stood it would fail to find the
temp function, and then either fail or execute the wrong function entirely.
We still prohibit a worker from creating a temp namespace on its own.
In effect, a worker can only see the session's temp namespace if the leader
had created it before starting the worker, which seems like the right
semantics.
Also, transmit the leader's BackendId to workers, and arrange for workers
to use that when determining the physical file path of a temp relation
belonging to their session. While the original intent was to prevent such
accesses entirely, there were a number of holes in that, notably in places
like dbsize.c which assume they can safely access temp rels of other
sessions anyway. We might as well get this right, as a small down payment
on someday allowing workers to access the leader's temp tables. (With
this change, directly using "MyBackendId" as a relation or buffer backend
ID is deprecated; you should use BackendIdForTempRelations() instead.
I left a couple of such uses alone though, as they're not going to be
reachable in parallel workers until we do something about localbuf.c.)
Move the thou-shalt-not-access-thy-leader's-temp-tables prohibition down
into localbuf.c, which is where it actually matters, instead of having it
in relation_open(). This amounts to recognizing that access to temp
tables' catalog entries is perfectly safe in a worker, it's only the data
in local buffers that is problematic.
Having done all that, we can get rid of the test in has_parallel_hazard()
that says that use of a temp table's rowtype is unsafe in parallel workers.
That test was unduly expensive, and if we really did need such a
prohibition, that was not even close to being a bulletproof guard for it.
(For example, any user-defined function executed in a parallel worker
might have attempted such access.)
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Mostly these are just comments but there are a few in documentation
and a handful in code and tests. Hopefully this doesn't cause too much
unnecessary pain for backpatching. I relented from some of the most
common like "thru" for that reason. The rest don't seem numerous
enough to cause problems.
Thanks to Kevin Lyda's tool https://pypi.python.org/pypi/misspellings
|
|
|
|
|
|
|
|
| |
Commit 1aba62ec moved the range check of that option form guc.c into
bufmgr.c, but introduced a bug by changing a >= 0.0 to > 0.0, which made
the value 0 no longer accepted. Put it back.
Reported by Jeff Janes, diagnosed by Tom Lane
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
That commit increased all shared memory allocations to the next higher
multiple of PG_CACHE_LINE_SIZE, but it didn't ensure that allocation
started on a cache line boundary. It also failed to remove a couple
other pieces of now-useless code.
BUFFERALIGN() is perhaps obsolete at this point, and likely should be
removed at some point, too, but that seems like it can be left to a
future cleanup.
Mistakes all pointed out by Andres Freund. The patch is mine, with
a few extra assertions which I adopted from his version of this fix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Even with old_snapshot_threshold = -1 (which disables the "snapshot
too old" feature), performance regressions were seen at moderate to
high concurrency. For example, a one-socket, four-core system
running 200 connections at saturation could see up to a 2.3%
regression, with larger regressions possible on NUMA machines.
By inlining the early (smaller, faster) tests in the
TestForOldSnapshot() function, the i7 case dropped to a 0.2%
regression, which could easily just be noise, and is clearly an
improvement. Further testing will show whether more is needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reverted changes were intended to force a choice of whether any
newly-added BufferGetPage() calls needed to be accompanied by a
test of the snapshot age, to support the "snapshot too old"
feature. Such an accompanying test is needed in about 7% of the
cases, where the page is being used as part of a scan rather than
positioning for other purposes (such as DML or vacuuming). The
additional effort required for back-patching, and the doubt whether
the intended benefit would really be there, have indicated it is
best just to rely on developers to do the right thing based on
comments and existing usage, as we do with many other conventions.
This change should have little or no effect on generated executable
code.
Motivated by the back-patching pain of Tom Lane and Robert Haas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Coverity complained that oldPartitionLock was possibly dereferenced after
having been set to NULL. That actually can't happen, because we'd only use
it if (oldFlags & BM_TAG_VALID) is true. But nonetheless Coverity is
justified in complaining, because at line 1275 we actually overwrite
oldFlags, and then still expect its BM_TAG_VALID bit to be a safe guide to
whether to release the oldPartitionLock. Thus, the code would be incorrect
if someone else had changed the buffer's BM_TAG_VALID flag meanwhile.
That should not happen, since we hold pin on the buffer throughout this
sequence, but it's starting to look like a rather shaky chain of logic.
And there's no need for such assumptions, because we can simply replace
the (oldFlags & BM_TAG_VALID) tests with (oldPartitionLock != NULL),
which has identical results and makes it plain to all comers that we don't
dereference a null pointer. A small side benefit is that the range of
liveness of oldFlags is greatly reduced, possibly allowing the compiler
to save a register.
This is just cleanup, not an actual bug fix, so there seems no need
for a back-patch.
|
|
|
|
|
|
|
|
| |
pg_xlogdump includes bufmgr.h. With a compiler that emits code for
static inline functions even when they're unreferenced, that leads
to unresolved external references in the new static-inline version
of BufferGetPage(). So hide it with #ifndef FRONTEND, as we've done
for similar issues elsewhere. Per buildfarm member pademelon.
|
|
|
|
|
|
|
|
|
|
|
| |
My previous attempt at doing so, in 80abbeba23, was not sufficient. While that
fixed the problem for bufmgr.c and lwlock.c , s_lock.c still has non-constant
expressions in the struct initializer, because the file/line/function
information comes from the caller of s_lock().
Give up on using a macro, and use a static inline instead.
Discussion: 4369.1460435533@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current definition of init_spin_delay (introduced recently in
48354581a) wasn't C89 compliant. It's not legal to refer to refer to
non-constant expressions, and the ptr argument was one. This, as
reported by Tom, lead to a failure on buildfarm animal pademelon.
The pointer, especially on system systems with ASLR, isn't super helpful
anyway, though. So instead of making init_spin_delay into an inline
function, make s_lock_stuck() report the function name in addition to
file:line and change init_spin_delay() accordingly. While not a direct
replacement, the function name is likely more useful anyway (line
numbers are often hard to interpret in third party reports).
This also fixes what file/line number is reported for waits via
s_lock().
As PG_FUNCNAME_MACRO is now used outside of elog.h, move it to c.h.
Reported-By: Tom Lane
Discussion: 4369.1460435533@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The recent patch to make Pin/UnpinBuffer lockfree in the hot
path (48354581a), accidentally used pg_atomic_fetch_or_u32() in
MarkLocalBufferDirty(). Other code operating on local buffers was
careful to only use pg_atomic_read/write_u32 which just read/write from
memory; to avoid unnecessary overhead.
On its own that'd just make MarkLocalBufferDirty() slightly less
efficient, but in addition InitLocalBuffers() doesn't call
pg_atomic_init_u32() - thus the spinlock fallback for the atomic
operations isn't initialized. That in turn caused, as reported by Tom,
buildfarm animal gaur to fail. As those errors are actually useful
against this type of error, continue to omit - intentionally this time -
initialization of the atomic variable.
In addition, add an explicit note about only using pg_atomic_read/write
on local buffers's state to BufferDesc's description.
Reported-By: Tom Lane
Discussion: 1881.1460431476@sss.pgh.pa.us
|
|
|
|
|
|
|
|
|
|
|
|
| |
I was initially concerned that the some of the hundreds of
references to BufferGetPage() where the literal
BGP_NO_SNAPSHOT_TEST were passed might not optimize as well as a
macro, leading to some hard-to-find performance regressions in
corner cases. Inspection of disassembled code has shown identical
code at all inspected locations, and the size difference doesn't
amount to even one byte per such call. So make it readable.
Per gripes from Álvaro Herrera and Tom Lane
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pinning/Unpinning a buffer is a very frequent operation; especially in
read-mostly cache resident workloads. Benchmarking shows that in various
scenarios the spinlock protecting a buffer header's state becomes a
significant bottleneck. The problem can be reproduced with pgbench -S on
larger machines, but can be considerably worse for queries which touch
the same buffers over and over at a high frequency (e.g. nested loops
over a small inner table).
To allow atomic operations to be used, cram BufferDesc's flags,
usage_count, buf_hdr_lock, refcount into a single 32bit atomic variable;
that allows to manipulate them together using 32bit compare-and-swap
operations. This requires reducing MAX_BACKENDS to 2^18-1 (which could
be lifted by using a 64bit field, but it's not a realistic configuration
atm).
As not all operations can easily implemented in a lockfree manner,
implement the previous buf_hdr_lock via a flag bit in the atomic
variable. That way we can continue to lock the header in places where
it's needed, but can get away without acquiring it in the more frequent
hot-paths. There's some additional operations which can be done without
the lock, but aren't in this patch; but the most important places are
covered.
As bufmgr.c now essentially re-implements spinlocks, abstract the delay
logic from s_lock.c into something more generic. It now has already two
users, and more are coming up; there's a follupw patch for lwlock.c at
least.
This patch is based on a proof-of-concept written by me, which Alexander
Korotkov made into a fully working patch; the committed version is again
revised by me. Benchmarking and testing has, amongst others, been
provided by Dilip Kumar, Alexander Korotkov, Robert Haas.
On a large x86 system improvements for readonly pgbench, with a high
client count, of a factor of 8 have been observed.
Author: Alexander Korotkov and Andres Freund
Discussion: 2400449.GjM57CE0Yg@dinodell
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This feature is controlled by a new old_snapshot_threshold GUC. A
value of -1 disables the feature, and that is the default. The
value of 0 is just intended for testing. Above that it is the
number of minutes a snapshot can reach before pruning and vacuum
are allowed to remove dead tuples which the snapshot would
otherwise protect. The xmin associated with a transaction ID does
still protect dead tuples. A connection which is using an "old"
snapshot does not get an error unless it accesses a page modified
recently enough that it might not be able to produce accurate
results.
This is similar to the Oracle feature, and we use the same SQLSTATE
and error message for compatibility.
|
|
|
|
|
|
|
|
|
|
|
| |
This patch is a no-op patch which is intended to reduce the chances
of failures of omission once the functional part of the "snapshot
too old" patch goes in. It adds parameters for snapshot, relation,
and an enum to specify whether the snapshot age check needs to be
done for the page at this point. This initial patch passes NULL
for the first two new parameters and BGP_NO_SNAPSHOT_TEST for the
third. The follow-on patch will change the places where the test
needs to be made.
|
| |
|
|
|
|
| |
Oskari Saarenmaa
|
|
|
|
|
| |
Reported-By: Peter Eisentraut
Discussion: 56E2239E.1050607@gmx.net
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Up to now checkpoints were written in the order they're in the
BufferDescriptors. That's nearly random in a lot of cases, which
performs badly on rotating media, but even on SSDs it causes slowdowns.
To avoid that, sort checkpoints before writing them out. We currently
sort by tablespace, relfilenode, fork and block number.
One of the major reasons that previously wasn't done, was fear of
imbalance between tablespaces. To address that balance writes between
tablespaces.
The other prime concern was that the relatively large allocation to sort
the buffers in might fail, preventing checkpoints from happening. Thus
pre-allocate the required memory in shared memory, at server startup.
This particularly makes it more efficient to have checkpoint flushing
enabled, because that'll often result in a lot of writes that can be
coalesced into one flush.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently writes to the main data files of postgres all go through the
OS page cache. This means that some operating systems can end up
collecting a large number of dirty buffers in their respective page
caches. When these dirty buffers are flushed to storage rapidly, be it
because of fsync(), timeouts, or dirty ratios, latency for other reads
and writes can increase massively. This is the primary reason for
regular massive stalls observed in real world scenarios and artificial
benchmarks; on rotating disks stalls on the order of hundreds of seconds
have been observed.
On linux it is possible to control this by reducing the global dirty
limits significantly, reducing the above problem. But global
configuration is rather problematic because it'll affect other
applications; also PostgreSQL itself doesn't always generally want this
behavior, e.g. for temporary files it's undesirable.
Several operating systems allow some control over the kernel page
cache. Linux has sync_file_range(2), several posix systems have msync(2)
and posix_fadvise(2). sync_file_range(2) is preferable because it
requires no special setup, whereas msync() requires the to-be-flushed
range to be mmap'ed. For the purpose of flushing dirty data
posix_fadvise(2) is the worst alternative, as flushing dirty data is
just a side-effect of POSIX_FADV_DONTNEED, which also removes the pages
from the page cache. Thus the feature is enabled by default only on
linux, but can be enabled on all systems that have any of the above
APIs.
While desirable and likely possible this patch does not contain an
implementation for windows.
With the infrastructure added, writes made via checkpointer, bgwriter
and normal user backends can be flushed after a configurable number of
writes. Each of these sources of writes controlled by a separate GUC,
checkpointer_flush_after, bgwriter_flush_after and backend_flush_after
respectively; they're separate because the number of flushes that are
good are separate, and because the performance considerations of
controlled flushing for each of these are different.
A later patch will add checkpoint sorting - after that flushes from the
ckeckpoint will almost always be desirable. Bgwriter flushes are most of
the time going to be random, which are slow on lots of storage hardware.
Flushing in backends works well if the storage and bgwriter can keep up,
but if not it can have negative consequences. This patch is likely to
have negative performance consequences without checkpoint sorting, but
unfortunately so has sorting without flush control.
Discussion: alpine.DEB.2.10.1506011320000.28433@sto
Author: Fabien Coelho and Andres Freund
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a process is waiting for a heavyweight lock, we will now indicate
the type of heavyweight lock for which it is waiting. Also, you can
now see when a process is waiting for a lightweight lock - in which
case we will indicate the individual lock name or the tranche, as
appropriate - or for a buffer pin.
Amit Kapila, Ildus Kurbangaliev, reviewed by me. Lots of helpful
discussion and suggestions by many others, including Alexander
Korotkov, Vladimir Borodin, and many others.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In 4b4b680c3 I accidentally used sizeof(PrivateRefCountArray) instead of
sizeof(PrivateRefCountEntry) when creating the refcount overflow
hashtable. As the former is bigger than the latter, this luckily only
resulted in a slightly increased memory usage when many buffers are
pinned in a backend.
Reported-By: Takashi Horikawa
Discussion: 73FA3881462C614096F815F75628AFCD035A48C3@BPXM01GP.gisp.nec.co.jp
Backpatch: 9.5, where thew new ref count infrastructure was introduced
|
|
|
|
|
|
| |
Previously, we had a mix of styles.
Amit Kapila
|
|
|
|
|
|
| |
This reverts commit 3971f64843b02e4a55d854156bd53e46a0588e45 and a
couple of followon debugging commits; I think we've learned what we can
from them.
|